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This publication is a guide for programmers and analysts who have an 
interest in producing software that can be multitaskea during execution 
on Cray Research Computer Systems. 


This publication describes the multitasking features and associated 
concepts provided with the Cray Operating System (COS) . It describes how 
to use the features and how to produce executable programs that generate 
correct results. 

The reader is assumed to be familiar with the contents of the CRAY-OS 
Version 1 Reference Manual, publication SR-0011, and to be experienced in 
coding the Cray FORTRAN (CFT) language as described in the FORTRAN (CFT) 
Reference Manual, publication SR-0009. 

The following Cray Research publications also contain information useful 
to programmers developing multitasking software on CRI computers: 

SR-0000 CAL Assembler Reference Manual, defining the CAL assembly 
language 

SR-0012 Macros and Opdefs Reference Manual, defining standard CAL 
macros 

SR-0014 Library Reference Manual, describing subroutines in default 
libraries 

SG-0056 Symbolic Interactive Debugger (SID) Reference Manual, 
describing SID commands and their use 

SR-0060 Pascal Reference Manual, defining the Pascal language 

SM-0066 Segment Loader (SEGLDR) Reference Manual, describing the 
segment loader 

SN-0220 CRAY-1 Optimization Guide, describing techniques for 

improving the performance of CFT code, primarily through 
the more effective use of vector ization 
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The following references discuss multitasking in a more general sense: 


Andrews, G. , and Schneider, F., "Concepts and Notations for Concurrent 
Programming", ACM Computing Surveys , Vol. 15, Number 1 (March 1983), 
pp. 3-44. 

Brinch Hansen, P. , "Concurrent Programming Concepts", ACM Computing 
Surveys , Vol. 5, Number 4 (Dec. 1973) , pp. 223-245. 
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Multitasking is a mode of operation that provides for the execution of 
two or more parts of a single program in parallel. A job efficiently 
multitasked requires less execution time, when measured from start to 
finish ( wall-clock time) on a dedicated multiprocessor system, than a 
job that is not multitasked. The CRAY X-MP Computer System, with 
multiple central processor units (CPUs) , is the primary motivation for 
the development of multitasking system software to run with COS. 

To achieve any improvement, a programmer must understand the algorithm 
and structure of the candidate program. Some codes lend themselves to 
multitasking, while others do not. The programmer must decide whether 
possible gains warrant the effort involved in modifying or designing a 
program to use multitasking. 

This publication supplies information to help in this analysis and 
decision. In general, it covers multitasking as implemented at Cray 
Research, Inc., including: 

© Concepts related to multitasking 

« Descriptions of features 

• Procedures and advice for programmers producing multitasked code 
from existing code 

Because multitasking performance improvements are not available on CRAY-1 
Computer Systems, this document assumes that the code is being run on a 
CRAY X-MP System. However, multitasked codes can be run on CRAY-1 
Computer Systems, and this capability is valuable for program development 
and debugging. Appendix A provides information for running multitasked 
codes on CRAY-1 Computer Systems. 


NOTE 

The terminology of multitasking within the computer 
industry is far from standard. Section 2 and the 
glossary contain definitions for the terms as used in 
this publication. 




SN-0222 


1-1 


1.1 MULTITASKING TRADE-OFFS 


Before considering the details of multitasking , we will review the 
trade-offs involved with modifying a program for multitasking. 

The best theoretical gain that can be achieved from multitasking is that 
a job running on a dedicated system in wall clock time, t, without 
being multitasked could run in wall clock time t/n if modified to use 
n or more parallel tasks on a machine with n CPUs. On a CRAY X-MP, 
with n as 2, the optimum wall clock speedup due to multitasking would 
be a factor of 2. 

In practice, a speedup factor of 2 is not possible. Several factors 
limit the maximum improvement for a given program. 

• Not all parts of a program can be divided into parallel tasks. 

Many algorithms do not have a parallel structure or have only a 
portion that is parallel. 

® Those parts that can be multitasked may have dependencies on one 
another that result, at run time, in one or more tasks having to 
wait until others complete some operation. During this wait time, 
the waiting tasks do not contribute to parallel execution. 

• Use of the multitasking features incurs a certain amount of 
overhead that is lost to the job. The more these features are 
used, the greater the overhead. 

The following sections address these factors in more detail. 

Consideration of these factors is especially important because 
multitasking may increase a program’s wall-clock time if the above 
factors decrease performance more than parallel execution improves it. 
This is a situation we want to be able to predict before investing too 
much time and effort. 

The implementation of multitasking at Cray Research is aimed at the large 
job running in a dedicated environment. Multitasked programs may be run 
in a batch environment, but improvement in wall-clock time may vary 
greatly from run to run, depending upon other activity in the system. 
Total system throughput may decrease if the increased CPU time used by 
multitasked jobs reduces the time available to other jobs. (A batch job 
that requires all of Central Memory effectively executes in a dedicated 
environment. In such a case, multitasking should be seriously 
considered.) 

Another issue a programmer should consider is that multitasked programs 
may not be as straightforward to test and debug as regular programs. 

When two or more parts of a program are executed at the same time, timing 
errors can arise. These errors may not be reproducible, and limited 
facilities are currently available to help analyze or prevent such timing 
errors. 
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Converting a program for multitasking more understanding and analysis 
than taking advantage of vectorization in CFT. Vector ization, which can 
give performance improvements for scalar code at least as good as 
multitasking, is performed automatically by the CFT compiler. Code 
modifications can often be made to improve the amount of code that can be 
vectorized, but these tend to be small changes and localized to inner 
DO-loops. The majority of the modifications to vectorized code are 
safe: CPU time rarely increases and answers remain correct. 

On the other hand, multitasking code is produced by the programmer with 
limited analysis aids. Code modifications tend to involve larger 
segments of code, since multitasking can enhance the performance of outer 
loops containing already vectorized code. The programmer must appreciate 
•the overhead costs of multitasking and be willing to enforce the rules 
necessary for producing correct results. 

Multitasking is valuable to certain applications, and programmers should 
consider it as a possible performance enhancement. The ratio of costs to 
benefits should be calculated for each application. 


1.2 MULTITASKING OVERVIEW 

Multitasking has been added to COS and the Cray Research product set as 
an enhancement to existing software. The general implementation scheme 
consists of providing library subroutines, callable from CFT code, that 
interface with COS. COS was modified to support the concept of multiple 
tasks within a single user job. 


1.2.1 COS 

Version 1.13 of COS provides for multitasking within job steps; as 
always, job steps are executed sequentially. Using the library routines, 
a program executing in a job step can create additional tasks, thus 
bringing about multitasking. A multitasked job step is not complete 
until all tasks within the job step complete. 

The following example shows the lifetimes of different tasks for a job 
that builds and runs a program that is partitioned into three tasks. 

Most of the job steps use only one task, but the bulk of the execution 
time probably lies in the program execution. 
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Task 1 


Task 2 


Task 3 


JOB, JN=TMULT. . . 

ACCOUNT ,AC=. . . 

CFT , ALLOC= STACK . . . 

LDR , AB=MTPROG , NX , . . . 
ACCESS , DN=DATA, PDN=DATA1 
MTPROG . 

SAVE , DN=OUT , PDN=OUTl , . . . 
RELEASE , DN=DATA 2 OUT . 
ACCESS , DN=DATA , PDN=DATA2 
MTPROG. 

SAVE , DN=OUT , PDN=OUT2 , . . . 


X 

X 

X 

X 

X 

XXX 

X 

X 

X 

XXX 

X 


No CRI software products or utilities have been internally multitasked. 
Successive compilation steps, for example, do not execute in parallel. 


A COS job that is multitasked can run on the same system with jobs that 
are not multitasked. Although the wall-clock time may increase and the 
order of execution may change, a properly multitasked job should see no 
change in results. 


1.2.2 LIBRARIES 

Elementary user-level facilities were implemented as library subroutines 
for ease of development, enhancement, and extension. These features are 
described in section 3. 


1.3 FUTURE 

The primary direction of enhancements to multitasking at Cray Research 
will be improved performance and ease of use. Performance improvements 
will primarily be internal in nature, while ease-of-use enhancements may 
include the following types of changes: 

• Additional constructs in the libraries and or languages 

• Additional debugging aids or tools to help in the analysis of 
potential multitasking programs and in the subsequent debugging of 
multitasked code. 

Some enhancements in these areas are already being planned, but the bulk 
depend heavily on ideas and suggestions of the people who use the initial 
multitasking facilities. 


SN-0222 


1-4 



CONCEPTS 
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( ) 


This section defines the concepts and terminology of multiprocessing as 
they are applied by Cray Research. 


2.1 PARALLELISM 

As used here, parallel refers to the manner in which software processes 
are executed. Jobs, job steps, programs, and parts of programs are 
parallel if they are processed simultaneously (or nearly so) rather than 
sequentially. 

Levels of parallelism are defined in terms of the types of software 
processes that are executed in parallel. 


Level 1 
Level 2 
Level 3 
Level 4 
Level 5 


Independent jobs, each job having a CPU 
Job steps: related parts of the same job 
Routines and subroutines 
Loops 

Statements 


The higher the number of the level, the smaller the size or granularity 
of tasks. 


Parallelism is not new to multitasking. With Cray Research Computer 
Systems, parallelism at levels 4 and 5 has been exploited for years. 
Vector processing is the parallel processing of iterations of loops 
(level 4) . The CFT compiler schedules generated instructions in a manner 
that exploits the independence and different speeds of the hardware 
functional units; this leads to parallel execution of different 
statements (level 5) . 


r 
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2.2 MULTIPROGRAMMING 


Multiprogramming is a mode of operation that provides for the sharing 
of processor resources among multiple/ independent, software processes. 
Many computer systems use multiprogramming to make the most efficient use 
of a single CPU. In this mode several processes are ready to run, and if 
one process is delayed by I/O, another process is immediately scheduled 
to run on the CPU. In contrast, a system in dedicated mode has only one 
process ready to run, and any delays leave the CPU idle. The processor 
resource can be more than one CPU; each CPU could be shared by several 
software processes. 


Example : 

COS Version 1.11 is a multiprogramming operating system. The 
processor resource is one CPU, and the software processes are jobs. 
Sharing is managed by the Job Scheduler within COS by assigning 
priorities to jobs and allocating CPU time a slice at a time to 
different jobs. Figure 2-1 illustrates this type of multiprogramming. 


Software processes Processor resources 


+ + 

I JOB A | 
+ + 


+ + 

I JOB B | 
+ + 


+ + 

I JOB C| 
+ + 


+ + 

I JOB n\ 
+ + 


+ + 



+ + 

I I 

| ONE | 

| CPU | 

I ! 

-j — 


Figure 2-1. Multiprogramming in COS 
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2 . 3 MULTIPROCESSING 


Multiprocessing is a mode of operation that provides for parallel 
processing by two or more processors. That is, the processors are all 
working at the same time without adversely affecting each other. 


Example : 

Under COS Version 1.12, two independent jobs can be run in parallel 
on a CRAY X-MP Computer System. This is sometimes referred to as the 
processors running separate job streams. The job is the scheduling 
unit of the system, and two processors are scheduled in a 
multiprogramming mode. Truly independent jobs won't affect each 
other, but two jobs using the same dataset can interfere with each 
other and thus are not independent. 

This example of independent uniprocessing exploits parallelism at level 
1. System throughput is enhanced over single processor configurations. 
Individual jobs receive no benefits. 

Application of more than one processor to a single job implies that the 
job has software processes (parts) that can be executed in parallel. 

Such a job can be logically or functionally divided in such a way that 
two or more parts of the work can be executed simultaneously (that is, in 
parallel) . An example of this is a weather modeling job where the 
northern hemisphere calculation is one part and the southern hemisphere 
another part. Another example of a job that can be functionally divided 
is a program having a sort operation on a database that can be run 
independently of a formatting operation on some previously processed data. 

Distinct code segments need not be involved. The same code could run on 
multiple processors at the same time, with each processor acting on 
different data. 


2 . 4 TASK 


A task is a unit of computation that can be scheduled. The 
instructions in a task are processed in sequential order. A task is a 
software process. 

In a single processor multiprogramming operating system such as COS 
Version 1.11, a job is a task. In a multiprocessing environment 
supported by COS Version 1.13, a job is still a task, but it may spin 
off other tasks to run in parallel with it. 
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To take advantage of a multiprocessing operating system, a job must be 
divided into two or more tasks* That is, for parts of the job to run in 
parallel on more than one processor, the parts must be scheduled 
separately. 

A task is a uniquely named process that can have code and data areas in 
common with (or even identical to) other tasks of the same job. The code 
executed by a task is a subroutine. The same work can be performed by 
calling the subroutine or by starting up a task to execute the 
subroutine. The difference is that the call causes the work to be 
performed immediately; in the task, the work is scheduled and performed 
independently and in parallel with other tasks in the program. 


NOTE 

The term task in Cray Research publications refers to 
several different types of software entities. Except 
as otherwise indicated, any reference to task in this 
publication uses the definition above, which 
corresponds to the concept of library task in other 
Cray Research publications. 


2.5 MULTITASKING 

Multitasking is a special case of multiprocessing defining a task to be 
a job step or subprogram. Parallelism levels 2 and 3 are multitasking 
modes of operation. At Cray Research, multitasking is currently only 
supported for subprograms (parallelism level 3) . 

In a multitasking environment, the tasks and data structure of a job must 
be such that the tasks can be run in parallel. There is no guarantee 
that more than one processor will be able to work on the tasks of a given 
job, no guarantee that the tasks will execute in any particular order, 
and no guarantee of which task will finish first. The availability of 
processors and the order of execution and completion of tasks are 
functions of the scheduling policies of the library and operating 
system. Multitasking is nondetermini Stic with respect to time. 

Tasks, however, must be made deterministic with respect to results. The 
key to a successful multitasked program is to precisely define and add 
the necessary communication and synchronization mechanisms between 
parallel tasks and to provide for the protection of shared data. 
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The following example is a simple case in which two tasks execute without 
interruption on two processors: 

Task A 

Task B 


time — > 

The next example illustrates a case in which only one processor is 
available, and tasks C and D must share it. Multitasking can be 
performed on machines with one processor. 


Task C 
waits 

Task C 


Task D 


time — > 


Task D waits 


In the third example, two tasks share two processors. Note that at 
several points, only one processor is actually in use by the job, and at 
one point, neither is assigned to the job. Note also that there is no 
indication of which physical processor is assigned to which task; this 
assignment is transparent to the program. 


Task E 


Task E waits 


Task F 

Task F is interrupted 

time — > 


2.6 SCOPE 

The scope of a variable is the region of a program in which the 
variable is defined and can be referenced. Outside of a variable's 
scope, the variable is not defined and references to the variable's name 
either refer to another variable of the same name (as in FORTRAN) or are 
treated as an error condition (as in Pascal or CAL) if not otherwise 
declared. 
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Each task is comprised of executable instructions and a well-defined set 
of data upon which the instructions act. The set of data corresponding 
to a task can be divided into two subsets. One subset is the data 
local to the task. That is, it is defined and accessible only by that 
task. , A second subset is comprised of data that is common to the task 
and at least one other task. That is, the data is defined for and 
accessible by other tasks. 

A concept related to scope is that of the lifetime of a variable* s 
contents. CFT, like many other FORTRAN compilers, guarantees local 
variables only for the lifetime of the subroutine containing them. 

Common variables, which are in COMMON blocks, are guaranteed for the 
lifetime of the entire program. In a FORTRAN program that is not 
multitasked, this distinction can often be ignored, since the local 
variables are usually assigned to a fixed location in Central Memory to 
improve performance. In a multitasked program, the location of local 
variables can change and the memory space reused. This makes the 
distinction important to understand and respect. 

All communication between tasks must be done by means of data common to 
those tasks. Data that is to be worked on by more than one task (for 
example, a large array) must also be included in the common data. 

Variables used in the internal functioning of a task (for example, loop 
indices and other control variables) must be included in the locally 
defined data. These variables are defined before they are used in the 
task's code. 

Special care must be taken when multitasking is performed on tasks with 
identical code (that is, the same subroutines are associated with 
different tasks) . Although data can be common to all routines making up 
the code, it may be that it should not be shared by the various tasks. 
This is an example of task common. Confusion can arise from the fact 
that the traditional scope of variables is determined by the division of 
code in program units. When we divide code into tasks, the concept of 
data being common to subroutines may not imply that that data should be 
common to tasks. 


2.7 CRITICAL REGION 

A critical region is a segment of code that accesses a shared 
resource. This resource may be Central Memory, I/O files, subroutines, 
or anything else that is shared by the tasks in a job. (Most examples of 
critical regions in this publication relate generally to shared memory. 
The concepts and techniques apply equally well to any shared resource.) 
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For example, indeterminate results can arise from more than one task 
simultaneously reading from and storing into shared memory locations. 
Neither task can be sure that the data it is reading is as expected, nor 
that the area of memory stored to is ready to be overwritten. 

Shared memory is memory common to two or more tasks and accessed by 
them when there is a chance they might be running in parallel. Shared 
memory is a subset of common memory (defined earlier) . As an example, 
consider the following subroutines: 

SUBROUTINE MTASK1 
COMMON / COMA/AAA , BBB 
REAL CCC 

AAA = 0. 

(start task MTASK2) 

• • • 

AAA = AAA + 1. 

... 

(wait for completion of MTASK2) 

O • • 

BBB = AAA 

... 

END 

SUBROUTINE MTASK2 
COMMON /COMA/AAA, BBB 

• • . 

AAA = AAA + 1. 

• • • 

END 

In this example, variable AAA is shared, since MTASK1 and MTASK2 could 
change the variable at the same time. Variable BBB, although it is 
accessible by MTASK2, is not used by MTASK2 and is thus not shared. 
Variable BBB could become shared if MTASK2 were changed to use it. In 
contrast, local variable CCC in MTASK1 could not be used by MTASK2 unless 
both MTASK1 and MTASK2 were changed to make the variable common. 

Critical regions of code must be monitored if the program modules 
containing them are to run in parallel. This monitoring can be done by 
having one code segment set a lock when it enters a critical region. 

This amounts to the task putting up a flag to indicate that the shared 
data is being used. This system works only if all other tasks that could 
possibly be run in parallel check the lock before they enter a 
corresponding critical region. The monitoring operation consists of the 
following steps: 

1. Test to see if the lock is set. 
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2. If the lock is set, wait until it is cleared and proceed to step 3. 

If the lock is clear, proceed immediately to step 3. 

3. Clear the lock on exit from the critical region. 

In most implementations of this feature, including the Cray Research 
implementation, a task executing this operation waits in step 2, if the 
lock is set, until another task leaves the critical region and clears the 
lock. 

A program in which all instances of a critical region are successfully 
monitored is said to be implementing mutual exclusion within the critical 
region. In other words, if one task is in the region, all others are 
excluded. 

Because a task unable to enter a critical region will be forced to wait, 
it is important to keep the length of critical regions (in execution 
time) to a minimum. This goal must be balanced against the cost of the 
locking operation. A job that has overly large critical regions may see 
numerous cases of tasks waiting for entry; but a job with too many, 
overly small critical regions may see a high overhead penalty. 


Example : 


SUBROUTINE MTASK1 
DIMENSION A (1000) ,B(1000) 
COMMON/BLOCK/J , A , 3 , N 
INTEGER J LOCAL 

... 

C BEGIN CRITICAL REGION 
J LOCAL = J + N 
J = J LOCAL 

C END CRITICAL REGION 


DO 10 I = 0,N-1 

A (1+ JLOCAL) = B ( I+JLOCAL) 
10 CONTINUE 

• • • 

END 

SUBROUTINE MTASK2 
DIMENSION A (1000) ,B(1000) 
COMMON/BLOCK/J , A , B , N 
INTEGER JLOCAL 

... 

C BEGIN CRITICAL REGION 
JLOCAL = J + N 
J = JLOCAL 

C END CRITICAL REGION 
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DO 10 I = 0,N-1 

A ( I+JLOCAL) = B ( 1+ JLOCAL) 

10 CONTINUE 

• • • 

END 

In the above example , references to J in MTASK1 and MTASK2 are critical 
regions and must be monitored to ensure that each loads a different value 
of J, In this program, the way that J is used means that arrays A and B, 
although common, are not shared. 


2.8 REENTRANCY 

Reentrancy is a property of a program module that allows one copy of 
the module to be used by more than one task in parallel. A mechanism is 
supplied in which the routine’s local environment is recreated each time 
the routine executes. That is, local variables and control indicators 
are assigned independent storage locations each time the routine is 
called. 

Not all program modules need be coded in a reentrant manner. A module 
can be used only once during the lifetime of the program, or it can be in 
a critical region that ensures it is used by at most, one task at a time. 
The former is an example of nonreentrant code, the latter of serially 
reusable code. The locking operation described previously can be used 
to ensure and enforce the general property of serial reusability: to 
ensure that such code is executed by only one task at a time. 

This can be done in three ways. If the design of the program is such 
that no attempt is ever made to reenter it, then no special treatment is 
needed. If the entire subroutine is nonreentrant (as with CFT code 
compiled with ALLOC=STATIC; see section 3) , all calls to the subroutine 
must be treated as critical regions and must be locked. If the entry 
sequence is reentrant (as with CFT code compiled with ALLOC=STACK) , any 
nonreentrant parts of the subroutine can be locked within the subroutine. 


Examples : 

(1) Subroutine SERIAL is totally nonreentrant (compiled with 
ALLOC=STATIC) . 
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SUBROUTINE MTASK 
... (declarations) 

. . . (code) 

CALL LOCKON (LSERIAL) 

CALL SERIAL 

CALL LOCKOFF (LSERIAL) 

. . . (code) 

RETURN 

END 

(2) Subroutine SERIAL has a reentrant entry sequence (compiled with 
ALLOC=STACK) . 

SUBROUTINE SERIAL 

... (declarations, no code) 

CALL LOCKON (LSERIAL) 

• 

• (code) 


CALL LOCKOFF (LSERIAL) 

RETURN 

END 

Regardless of the reentrancy of a program module, any critical regions 
within it must still be monitored and locked. For example, consider the 
following two modules (both compiled with ALLOC= STACK ) 2 

SUBROUTINE SERIAL 

... (declarations, no code) 

CALL LOCKON (LSERIAL) 

. 

. (code) 

• 

CALL LOCKON (LCRIT1) 

. (critical region) 

CALL LOCKOFF (LCRIT1) 

• 

. (code) 

• 

CALL LOCKOFF (LSERIAL) 

RETURN 

END 




SN-0222 


2-10 



SUBROUTINE PARALLEL 


(declarations) 


CALL LOCKON (LCRIT1) 

. (critical region) 

CALL LOCKOFF (LCRIT1) 


RETURN 

END 

Even though SERIAL is serially reusable, it must separately protect the 
critical region with LCRIT1, because PARALLEL might be executing at the 
same time. LSERIAL can be used to protect both the critical region and 
the subroutine SERIAL, but this may have the disadvantage of increasing 
the size of the critical region. (PARALLEL would be locked out for the 
entire time SERIAL is executing, not just the time SERIAL is inside the 
critical region.) 

The Cray Research implementation of multitasking uses a stack mechanism 
to support reentrancy. A stack is a data structure providing a 
dynamic, sequential data list having special provisions for access from 
one end. A last in, first out (push down, pop up) access mechanism is 
used. Each time a reentrant subroutine is entered, a stackframe is 
allocated in memory for its local variables. This stackframe is then 
local to this instance of the subroutine. 


Example: 

Routine A is reentrant. 

Task 0 Routine A 

Task 1 Routine A 

time > 

Routine B is serially reusable. 

Task 0 Routine B 

Task 1 Routine B 

time > 
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2 . 9 LOAD BALANCING 


Load balancing is a technique that ensures the amount of work done by 
each of the processors involved in a job is approximately equal. When 
load balancing is achieved, all work that can be done in parallel is 
divided evenly among processors. There are two types of load balancing: 
static and dynamic. 

Static load balancing can be realized if the amount of work involved in 
each piece of a job can be determined ahead of time. Parallel tasks are 
then defined to each run a similar (preferably equal) amount of time. 

Dynamic load balancing is done for a program whose pieces have unknown 
workloads. Since predicting the amount of time a given piece will 
require is not possible, tasks should be constructed to keep themselves 
busy by looking for and executing the next piece of work. 

If all the work involved in a job can be done in parallel on n 
processors and the load is balanced among them, the wall clock time for 
the multitasking run can approach 1/n of the wall-clock, time for the 
job run on one processor. 


Example : 

One task (serial code) : 

| 1 1 1 — — 

Piece 1 Piece 2 Piece 3 Piece 4 

time > 


Two tasks (partially balanced code) : 

i 1 1 

Piece 1 Piece 3 

| 1 | 1 

Piece 2 Piece 4 

time > 

Two tasks (better balanced code) : 

| 1 | 

Piece 1 Piece 4 

| 1 1 

Piece 2 Piece 3 

time > 
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2.10 SYNCHRONIZATION 


Synchronization, as used in multitasking, is the method of coordinating 
the steps within processes that can be run in parallel. This 
coordination ensures that initial conditions for a process are met or 
that output from a process is ready to be used. 

A synchronization point is a point in time when a task receives the 
go-ahead to proceed with its processing. That is, whatever the task is 
awaiting has happened, and a signal has been sent to and received by the 
waiting task. 

Cray Research multitasking implementation provides two synchronization 
mechanisms : 

• Events, which provide a general way of signaling the occurrence of 
some programmer-defined event. Tasks can wait for events, post 
events (that others may be waiting for) , or clear events (reset 
them) . 

• A task can wait for another task to complete execution. This 
could be viewed as a higher level function built with the event 
mechanism (where the event is a task completion and is posted by 
the system) . 

Synchronization of tasks works only if all tasks perform their respective 
parts of the required communication. One task must signal the important 
occurrence; another task or tasks must wait for the signal, receive it, 
and clear the signaling device. 


Example : 

Two tasks : 


Task 0 I W — ( ) P : 

Task 1 I P-{ — ) W 

W Wait for event occurrence 

P Post event occurrence 

( Request to enter critical region 

) Leave critical region 


This example shows two tasks using events and critical regions. The 
total run time of each is increased by the periods each loses due to the 
synchronization and locking mismatches. The load balancing technique 
discussed above should take such possible synchronization delays into 
account. 
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2.11 DEADLOCK 


Deadlock is a condition in which locks and synchronization mechanisms 
have been misused to the point where a task is waiting for something to 
happen that can never happen* 

As a simple case, consider the following incorrect code segment: 

DO 10 1=1, N 
CALL LOCKON ( LOCK1 ) 

• • • 

10 CONTINUE 

CALL LOCKOFF (LOCK1) 

A task executing this code will successfully lock LOCK1 in the first 
iteration but will wait forever in the second iteration. Obviously, the 
programmer intended the call to LOCKOFF to be within the loop. 

A more frequently encountered form of deadlock is when two tasks wait for 
each other to complete some action. As an example, consider two tasks 
that each use two locks. If the order in which the nested locks are set 
is different, each task might successfully set one lock and wait for the 
other lock to be reset. Such a situation may not occur in every run, 
since it is tied to the timing of the two tasks. 

A deadlock need not initially involve all tasks in the job. Even if only 
a subset of tasks deadlock initially, the other tasks will either 
complete or will wait themselves. Eventually all active tasks in the job 
will be deadlocked. 

Deadlock detection is recognizing a deadlock situation after the 
deadlock has occurred. Deadlock prevention requires conventions or 
rules to ensure that deadlock does not occur. For example, a programmer 
can define a rule that any task needing more than one lock must set the 
locks in alphabetical order. 

This prevents deadlock, although at the possible cost of enlarging a 
critical region. Deadlock detection is a function of the system 
software, and deadlock prevention is generally a responsibility of the 
user . 


2.12 MULTITASKING EXAMPLE 

This section describes, at a high level, the design of a program that 
uses the multitasking features described above. In the next section, 
this example is expanded. 
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2.12.1 GENERAL APPLICATION 


The example program starts by reading data a unit at a time from an input 
dataset. Each unit may represent, for example, data recorded at some 
particular time, and the input dataset may be a sequence of such 
readings. Each unit of data is processed and a corresponding output unit 
written. These operations are repeated until the input dataset is 
completely read. 


2.12.2 INITIAL TASK 

The main program, or initial task, performs the following operations: 

1. Initializes control variables 

2. Starts a task to write the output unit 

3. Enters main loop 

• Reads unit of data (if end of data reached, leave loop) 

• Performs some preprocessing that cannot be multitasked 

• Starts two tasks, each of which processes half of the data using 
the same code for execution 

• Waits for the tasks to complete 

• Performs some post-processing that cannot be multitasked 

• Waits for the output task to write the previous unit of data 
(or, in the case of the first time through the loop, to 
initialize itself) 

• Signals the output task to write the next unit of data 
4« Performs next iteration of the loop 

5. Waits for the output task to write the last unit of data 

6. Signals the output task that all data has been supplied 
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2.12.3 OUTPUT TASK 


The output task allows output of one set of data to occur in parallel 
with processing of the next set of data. The only synchronization 
necessary is to ensure that one set is written before the next set is 
supplied. 

The output task performs the following operations: 

1. Initializes control variables, opens the dataset 

• Begins main loop 

• Signals ready condition 

» Waits for the initial task to supply data (if signal says all 
data has been supplied, leaves loop) 

• Outputs data 

2. Performs next iteration of loop 


2.12.4 PROCESSING TASKS 

The processing tasks are two copies of the same module, each processing 
half of a set of data. This single set of instructions is shared by two 
tasks created by the initial task to process a different half of the data. 

The shared module has the following operations: 

1. Initializes control variables 

2. Processes data 
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FEATURE DESCRIPTIONS 


3 


Multitasking features have been implemented in library subroutines. 

The feature descriptions and examples are all from the point of view of a 
CFT programmer. Section 9 provides information on multitasking in other 
languages. 


3.1 UNDERLYING ASSUMPTIONS 


This section covers various assumptions and warnings about the user 
program and the environment in which it is used. 


3.1.1 CRI SOFTWARE 

The multitasking implementation is part of the 1.13 COS and CFT release, 
and it assumes that all other system software is of that release or 
beyond. The CFT 1.13 compiler generates code only for the calling 
sequence introduced with the CFT 1.11 release. 

The libraries used with multitasked programs must be created with the 
multitasking assembly option. The multitasking assembly option generates 
the libraries with the following features: 

• Stack option of the calling sequence introduced with the CFT 1.11 
release 

• Nonreentrant I/O subroutines in $IOLIB protected with locks 

® Multitasking library subroutines enabled for multiple tasks 

The default libraries at most sites are built with the new calling 
sequence and should not be used with multitasked programs. 
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3.1.2 OVERLAYS AND SEGMENTS 


The multitasking features provide no explicit support for LDR overlays or 
for SEGLDR segments. Nothing prevents them from being used together, but 
the serial nature of loading segments and overlays conflicts with the 
parallel nature of multitasking. The programmer must take care if a 
combination of multitasking and overlays (or segments) is attempted. 


3.1.3 EXTENDING BLANK COMMON 

Some FORTRAN programs use a nonstandard technique known as extending 
blank common to provide dynamic memory management within user space. The 
use of stacks sometimes requires expansion of memory in user space; this 
is discussed later. Only one of these two memory extension mechanisms 
can be in control in a job. 

If the user wishes to extend blank common, the stack area can be set to a 
fixed size and placed below blank common. Attempts to expand the stack • 
then cause a job to abort. 

If blank common need not be expanded, the stack can be placed above it 
and be allowed to expand. Attempts to expand blank common cause 
unpredictable results. LDR and SEGLDR parameters allow the programmer to 
specify the proper configuration. 

As an alternative to extending blank common, programmers should consider 
using user-callable heap allocation routines (such as HPALLOC and 
HPDEALLC) . The stack management routines use these heap allocation 
routines, allowing both stacks and user code to share the same memory 
area. The subroutines are described in the Library Reference Manual, CRI 
publication SR-0014. 


3.1.4 CFT OPTIMIZATION 

The CFT compiler generates heavily optimized code. These optimizations 
are usually transparent to the programmer in a program run in 
nonmult itasked mode. In a multitasking environment, however, many of 
these optimizations may cause surprises. These optimizations include: 

• Reluctance to store variables to memory. Memory loads and stores 
are expensive, and CFT tries to retain variables in registers (A, 
S, 3, T, or V) as long as possible. The two exceptions: 
variables in COMMON are stored before any subroutine call, and all 
variables are stored when a subroutine completes execution. 
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• Reordering statements. The compiler may reorder statements when 
the order does not appear to modify the results. Reordering is 
intended to remove invariant statements from loops or to make the 
best use of gaps resulting from the different timings of hardware 
instructions (instruction scheduling) . Again, these optimizations 
are analyzed from a viewpoint other than that of multitasking. 

« Temporary storage. If an expression is included in an argument 
list, CFT calculates and stores the value of the expression in a 
temporary storage area. If the argument is passed to another 
task, the storage location may be reused by the initial task 
before the second task completes. 

The following code segment reflects some of these problems. Assume the 
code that called MYSUB had LOCK I, LOCKJ, I, and J safely stored in COMMON 
but decided to pass them as arguments . to MYSUB because of the software 
design. Assume further that actual code is being executed within the DO 
loop, but none of the code uses I and J except what is shown below. 

SUBROUTINE MYSUB (LOCKI , LOCKJ, I , J) 

EXTERNAL NEWSUB 
INTEGER ITCA (2) 

C 

ITCA (1) =2 
C 

DO 10 11=1,10 

CALL LOCKON (LOCKI) 

1 = 1+1 

CALL LOCKOFF (LOCKI) 

C 

CALL LOCKON (LOCKJ) 

J=Q 

CALL LOCKOFF (LOCKJ) 

10 CONTINUE 
C 

CALL TSKSTART ( ITCA, NEWSUB , 3*1) 

C 

END 

With the reentrant CFT compiler, the generated code executes as follows: 

SUBROUTINE MYSUB (LOCKI , LOCKJ, I ,J) 

EXTERNAL NEWSUB 

INTEGER ITCA (2) (allocated on stack) 

C 

ITCA (1) =2 

J=Q (moved out of loop) 
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c 

(reg TOO = 1) 

DO 10 11=1,10 

CALL LOCKON (LOCK I) 

(reg TOO = reg T00+1) 

CALL LOCKOFF (LOCK I) 

C 

CALL LOCKON (LOCKJ) 

CALL LOCKOFF (LOCKJ) 

10 CONTINUE 

(II = reg TOO) 

C 

(compute 3*1 and put on stack) 

(pass addresses of ITCA, NEWSUB 
and (3*1)) 

CALL TSKSTART ( ITCA f NEWSUB ,3*1) 

C 

END 

(and possibly complete task execution) 

In this example, the handling of I and J could be corrected if the two 
are specified in a COMMON block within MYSUB rather than being passed as 
arguments. The (3*1) problem could be worked around by setting a 
variable in COMMON to 3*1 and passing that variable as the parameter or 
by forcing the task executing MYSUB to wait until the completion of the 
task executing NEWSUB. 

Although these problems can be worked around, the fact remains that the 
code looked good and would work correctly in a single-tasked environment 
it but could fail in a multitasked environment. 


3.1.5 REPRIEVE PROCESS ING 

Multitasked programs can use reprieve processing, but care must be taken 
First, consider the two types of reprieve conditions: user-caused and 
environmental. 

For a user-caused condition (such as operand range error or 
floating-point error) , each task that can cause the error must issue a 
SETRPV request for the condition or conditions of interest. When a 
condition occurs, the reprieve code receives control from the task that 
caused the error; any other tasks are suspended. 
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For an environmental condition (such as interactive attention) , each task 
that can be in execution when the condition arises must issue a SETRPV 
request for the condition or conditions of interest. When a condition 
occurs, the reprieve code receives control from one of the running tasks; 
any other tasks are suspended. If one task omits the SETRPV and happens 
to be executing when the condition occurs, then the error condition is 
not reprieved and the job aborts. 

In either case, the reprieve code should execute either a CONTRPV or an 
ENDRPV request upon completion of the reprieve processing. If CONTRPV is 
executed, all tasks are suspended if possible. If ENDRPV is executed, 
termination of the job step completes. 


3.2 PARALLELISM 

The basic parallelism features deal with tasks. 


3.2.1 TASKS 

A task running under COS consists of code and data that can be scheduled 
for execution on a CPU. A job step can have any number of tasks, all of 
which are assigned the same priority and memory characteristics as the 
job itself. 

A task is defined as starting execution at a CFT entry point (typically a 
subroutine) , and it can call other subroutines during its execution. A 
task completes when it executes a RETURN statement in the subroutine 
where it began execution, when it executes a STOP statement or equivalent 
operation, or when its execution is aborted due to some error condition. 
When it ends by a RETURN, STOP, END, or CALL EXIT, only that task ends. 
When it terminates for an error condition or CALL ABORT, all other tasks 
are stopped as soon as possible. 

Any program executed under COS has an initial, root task created by the 
system. This task suffices for nonmult itasked programs and products, and 
these codes do not require modification to run on a system supporting 
multitasking. Multitasking within a program occurs as soon as the 
program explicitly creates another task through a call to TSKSTi^RT. 
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3.2.2 TASK STATES 


The library routines described in this section view tasks as being in one 
of two states: existing or not existing. A task exists from the time it 
is created to the time it completes execution. Among nonexistent tasks, 
no distinction is made between a task that has never existed and a task 
that has completed execution. 


3.2.3 TASK RELATIONSHIPS 

No default relationships exist between tasks. Decisions to use specific 
intertask relationships (such as coroutines or a parent-child 
relationship) are made and enforced by the programmer. 


3.2.4 TASK CONTROL ARRAY 

Each user-created task is represented by an integer task control array. 
The array must be built by the user program. At a minimum, the array 
must be two words in length. A third word is optional. At a later date, 
additional words can be defined, but these too are optional, thereby 
allowing existing code to continue to run. The current array structure 
is as follows: 



LENGTH Length in words of the array. The length must be set to a 
value of 2 or 3, depending on the optional use of the TASK 
VALUE field. The LENGTH field is set by the user before 
creating the task. 

TASK ID Task identifier assigned by the multitasking library when a 
task is created. This identifier is unique among active 
tasks within the job step. This field is used by the 
multitasking library for task identification but is of 
limited use to user programs. 
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TASK VALUE 

Optional value that can be set to any value by the user 
before creating the task* If TASK VALUE is used, LENGTH 
must be set to a value of 3. The task value can be used 
for any purpose. Suggested values include a task name or 
identifier generated by the programmer or a pointer to a 
task local storage area. During execution, a task can 
retrieve this value with the TSKVALUE subroutine. 


Example : 

PROGRAM MULTI 
INTEGER TASKARY (3) 

C 

C SET TASKARY PARAMETERS 

TASKARY (1) =3 
TASKARY ( 3) = ' TASK 1' 

C 

END 


3.2.5 TSKSTART 


TSKSTART initiates a task and is invoked using the following format: 


CALL TSKSTART ( taskarray , name [ , list ] ) 


taskarray Task control array used for this task. Word 1 must be 

set. Word 3, if used, must also be set. On return, word 2 
will be set to a unique task identifier that must not be 
changed by the program. 

name External entry point at which task execution begins. This 

name must be declared EXTERNAL in the program or subroutine 
making the call to TSKSTART. CFT does not allow a 
subroutine to use its own name in this parameter. 

list Optional list of arguments passed to the new task. This 

list can be of any length. 
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******************************************************* 


CAUTION 

Arguments passed in list are passed by address to the 
newly started task. As a result, the arguments become 
shared data whose subsequent use by different tasks 
must be synchronized. 

Do not pass expressions as arguments in list . CFT 
will store the computed expression on the stack at 
runtime and can reuse the storage any time after 
TSKSTART returns to the calling task (even though the 
started task may not have executed) . 

******************************************************* 


A call to TSKSTART is identical to CALL name [ ( list ) ] , except that 
name is executed as a task instead of a subroutine. 


Example : 

PROGRAM MULTI 

INTEGER TASK1ARY ( 3 ) , TASK 2 ARY ( 3 ) 

EXTERNAL PLLEL 
REAL DATA (40000) 

C 

C LOAD DATA ARRAY FROM SOME OUTSIDE SOURCE 

C 

C 

C CREATE TASK TO EXECUTE FIRST HALF OF THE DATA 
TASK1ARY (1) =3 
TASK1ARY ( 3 ) = 1 TASK 1* 

C 

CALL TSKSTART (TASK1ARY, PLLEL, DATA (1) , 20000) 

C 

C CREATE TASK TO EXECUTE SECOND HALF OF THE DATA 
TASK2ARY (1) =3 
TASK2ARY ( 3 ) = 1 TASK 2' 

C 

CALL TSKSTART (TASK2ARY, PLLEL, DATA ( 20001) , 20000) 
C 

END 
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3.2.6 TSKWAIT 


TSKWAIT waits for the indicated task to complete execution. It is 
invoked using the following format: 


CALL TSKWAIT (taskarray) 


taskarray Task control array 


Example : 

PROGRAM MULTI 

INTEGER TASK1ARY ( 3 ) , TASK 2 ARY ( 3 ) 

EXTERNAL PLLEL 
REAL DATA (40000) 

C 

C LOAD DATA ARRAY FROM SOME OUTSIDE SOURCE 

C 

C 

C CREATE TASK TO EXECUTE FIRST HALF OF THE DATA 
TASK1ARY (1) =3 
TASK1ARY ( 3 ) = 1 TASK 1* 

C 

CALL TSKSTART ( TASK1ARY f PLLEL f DATA ( 1 ) ) 

C 

C CREATE TASK TO EXECUTE SECOND HALF OF THE DATA 
TASK2ARY (1) =3 
TASK2ARY ( 3 ) = 1 TASK 2' 

C 

CALL TSKSTART ( TASK2ARY , PLLEL , DATA (20001) ) 
C 

C NOW WAIT FOR BOTH TO FINISH 
CALL TSKWAIT (TASK1ARY) 

CALL TSKWAIT (TASK 2 ARY) 

C 

C AND PERFORM SOME POST-EXECUTION CLEANUP 
C 

END 
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NOTE 


In the above example, TSKSTART is called once for each 
of the two tasks. Alternatively, the second TSKSTART 
could be replaced by a call to PLLEL and the TSKWAIT 
removed. This alternative approach reduces the 
overhead of the additional task, but it can make 
understanding the program structure more difficult. 

The two approaches produce the same results. 


3.2.7 TSKVALUE 

TSKVALUE retrieves the user identifier (if any) specified in the task 
control array used to create the executing task. TSKVALUE is invoked 
using the following form: 


CALL TSKVALUE (return) 


return Value held in word 3 of the task control array. A zero is 
returned if the array length was less than 3 or if the task 
is the initial, root task (described earlier in this 
section) . 


Example : 

SUBROUTINE PLLEL (DATA, SIZE) 

REAL DATA (SIZE) 

C 

C DETERMINE WHICH OUTPUT FILE TO USE 
CALL TSKVALUE (I VALUE) 

IF ( I VALUE .EQ. 'TASK 1') THEN 
IUNITNO = 3 

ELSEIF ( IVALUE .EO. 'TASK 2') THEN 
IUNITNO = 4 
ELSE 

STOP I Error condition; don't continue. 

END IF 
C 

END 
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3.3 SCOPES AND PROTECTION 


This section discusses CFT implementation of common and local data and 
the lock library feature. 


3.3.1 COMMON DATA IN CFT 

Multitasked programs written in CFT should keep common data in COMMON 
blocks. Space for such data is statically allocated at load time and the 
variable contents are guaranteed over the life of the program. When a 
subroutine is called, any common variables maintained in registers are 
stored into memory before the call. (Local data can remain in registers.) 


******************************************************* 

CAUTION 

Common variables should not be passed as parameters 
between subroutines in a multitasking system. These 
variables , even if they are located in COMMON blocks in 
a calling subroutine, are not treated as common 
variables in the called subroutine, and CFT may perform 
optimizations that lead to unexpected problems. The 
following example illustrates these problems: 

SUBROUTINE TEST (A, LOCKA, I) 

INTEGER LOCKA 
REAL A(20000) 

C 

CALL LOCKON (LOCKA) 

A (1+1) =A(I) *2 
CALL LOCKOFF (LOCKA) 

C 

RETURN 

END 

Neither LOCKA nor A (both shared) are in COMMON blocks 
in the compilation of this subroutine. CFT feels 
moving the assignment statement outside of the 
subroutine calls is safe, since it cannot detect any 
dependence between them. Further, CFT feels free to 
copy LOCKA to local storage for this subroutine and is 
obligated to store it to memory only on the RETURN. As 
a result, the call to LOCKON (and LOCKOFF) may 
reference a memory location on the stack for TEST and 
not the memory location of the original variable. 

******************************************************* 
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3.3.2 LOCAL DATA IN CFT 


Multitasked programs written in CFT should keep local data in a stack. 
This is accomplished by compiling these programs with the following 
parameter on the CFT control statements 

CFT , ALLOC=STACK , . . . 

With this statement, CFT allocates and accesses variables from a stack 
specific to the subroutine invocation. In this manner, local variables 
are truly local to the task executing the subroutine and do not conflict 
with any other task executing the subroutine. 

The programmer must also ensure that default and user libraries are used 
that have also been compiled or assembled to use a stack. 

Using a stack introduces some additional overhead to subroutine 
linkages. This overhead results from the need to allocate the stack 
space at run time. The stack management routines may be required to make 
expensive system requests to obtain more space. LDR and SEGLDR support 
several parameters that can reduce the possibility and number of such 
calls at the expense of obtaining memory space before it is needed. (See 
information under the "Tuning" heading later in this section for more 
information. ) 

Not all code needs to have its local variables allocated on a stack. For 
example, subroutines that are nonreentrant or serially reusable can be 
compiled with ALLOC=STATIC, the default setting. Code with statically 
allocated local variables can be combined with stack-allocated local 
variables in a user program, but this should be done carefully. If two 
tasks use a subroutine with statically allocated local variables at the 
same time, unpredictable results and intermittent errors can occur. It 
is recommended that a program be completely compiled and tested with 
stack-allocated local variables before making the effort to introduce 
modules recompiled with statically allocated local variables. 


******************************************************* 

CAUTION 

Using SAVE or DATA statements causes the referenced 
variables to be assigned to static memory locations, 
regardless of the setting of the ALLOC parameter. CFT 
generates a warning message for each variable in a SAVE 
or DATA statement if ALLOC=STACK is in effect. 

******************************************************* 
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3.3.3 LOCKS 


Locks are the facility for monitoring critical regions of code. The 
operation of locks follows the general description provided in section 2. 

A lock is represented by an integer variable. Lock variables should be 
kept in COMMON blocks to ensure their residency in Central Memory over 
subroutine calls. Henceforth, the terms lock and lock variable will 
be used interchangeably. 


Example : 

PROGRAM MULTI 

INTEGER LK INPUT , LKOUTPUT , LKCAL L 
REAL INDATA (20000) ,OUTDATA ( 20000) 

COMMON /CB INPUT/ LK INPUT, INDATA 
COMMON /CBOUTPUT/ LKOUTPUT , OUTDATA 
COMMON /MISC/ LKCALL 

C 

END 

In this example, two locks have been placed in the same COMMON blocks as 
the data to which they correspond. This is not necessary but adds to the 
understanding of the program. 


3.3.4 LOCKASGN 

LOCKASGN identifies an integer variable that the program intends to use 
as a lock. The LOCKASGN subroutine must be called for each lock variable 
before its use with any of the other lock subroutines. A lock is given 
an initial state of off or cleared. The LOCKASGN subroutine is invoked 
using the following format: 


CALL LOCKASGN {name) 


name Name of an integer variable to be used as a lock. The 

library stores an identifier into this variable. The 
user must not modify the variable after the call to 
LOCKASGN until it is released by a call to LOCKREL. 
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Example : 


PROGRAM MULTI 

INTEGER LK INPUT , LKOUTPUT , LKCALL 
REAL INDATA (20000) ,OUTDATA(20000) 
COMMON /CB INPUT/ LK INPUT, INDATA 
COMMON / CBOUTPUT/ LKOUTPUT , OUTDATA 
COMMON /MISC/ LKCALL 

C 

CALL LOCKASGN (LK INPUT) 

CALL LOCKASGN (LKOUTPUT) 

CALL LOCKASGN (LKCALL) 

C 

END 


******************************************************* 

CAUTION 

If a lock variable is not assigned before use in any of 
the other lock subroutines, the results are 
unpredictable. One common symptom is an abort with an 
ERROR EXIT and a P address of zero. 

******************************************************* 


3.3.5 LOCKON 

LOCKON sets a lock and returns control to the calling task. If the lock 
is already set, the task is suspended until the lock is cleared by 
another task. In either case, the lock is set by the task when it next 
resumes execution of user code. This means that placing LOCKON before a 
critical region can ensure that the code in that region will be executed 
only when the task has unique access to the lock. LOCKON is invoked 
using the following format: 


CALL LOCKON (name) 


name Name of an integer variable used as a lock 
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3.3.6 LOCKOFF 


LOCKOFF clears a lock and returns control to the calling task. The act 
of clearing the lock may allow another task to resume execution, but this 
is transparent to the task calling LOCKOFF. LOCKOFF is invoked using the 
following format : 


CALL LOCKOFF (name) 


name Name of an integer variable used as a lock 


Example : 

PROGRAM MULTI 
INTEGER LKOUTPUT 
REAL OUTDATA( 20000) 

COMMON /CBOUTPUT/ LKOUTPUT , OUTDAT A 
C 

CALL LOCKASGN (LKOUTPUT) 

C 

C 

CALL LOCKON (LKOUTPUT) 

DO 100 1=1,20000 

OUTDATA ( I ) =MAX (OUTDATA (I) ,0) 

100 CONTINUE 

CALL LOCKOFF (LKOUTPUT) 

C 

C ... 

END 


3.3.7 LOCKREL 

LOCKREL releases the identifier assigned to the lock. If a task is 
waiting for the lock, an error results. This subroutine is useful 
primarily to detect errors that can arise when a task is waiting on a 
lock that will never be cleared. The lock variable can be reused 
following another call to LOCKASGN. LOCKREL is invoked using the 
following format: 
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CALL LOCKREL ( name) 


name Name of an integer variable used as a lock 


Example : 

PROGRAM MULTI 
INTEGER LKOUTPUT 

REAL INDATA (20000) , OUTDATA (20000) 
COMMON /CBOUTPUT/ LKOUTPUT f OUTDATA 
C 

CALL LOCKASGN (LKOUTPUT) 

C 

C 

CALL LOCKON (LKOUTPUT) 

DO 100 1=1,20000 

OUTDATA ( I ) =MAX (OUTDATA (I) , 0) 

100 CONTINUE 

CALL LOCKOFF (LKOUTPUT) 

C 

C ... 

CALL LOCKREL (LKOUTPUT) 

C 

END 


3 . 4 SYNCHRONIZATION 


This section discusses the subroutines supporting events. 


3.4.1 EVENTS 

Events allow signaling between tasks. An event has two states: cleared 
and posted. 

An event is represented by an integer variable. Event variables should 
be kept in COMMON blocks to ensure their residency in memory over 
subroutine calls. Henceforth, the terms event and event variable 
will be used interchangeably. 
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Example : 

PROGRAM MULTI 
INTEGER EVSTART , EVDONE 
COMMON /EVENTS/ EVSTART f EVDONE 
C 

END 


3.4.2 EVASGN 

EVASGN identifies an integer variable that the program intends to use as 
an event. This subroutine must be called for an event variable before 
that variable is used with any of the other event subroutines. The 
initial state of the event is cleared. EVASGN is invoked using the 
following format: 


CALL EVASGN (name) 


y 


name Name of an integer variable to be used as an event. The 

library stores an identifier into this variable. The user 
must not modify the variable after the call to EVASGN, 
unless the variable is released by a call to EVREL. 


Example : 


PROGRAM MULTI 
INTEGER EVSTART, EVDONE 
COMMON /EVENTS/ EVSTART, EVDONE 
C 

CALL EVASGN (EVSTART) 

CALL EVASGN (EVDONE) 

C 

END 

******************************************************* 

CAUTION 

If an event variable is not assigned before it is used 
in any of the other event subroutines, the results are 
unpredictable. One common symptom is an abort with an 
ERROR EXIT and a P address of zero. 

******************************************************* 
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3.4.3 EVWAIT 


EWAIT waits until the specified event is posted. If the event is 
already posted, the task resumes execution without waiting. EWAIT does 
not change the state of the event. EWAIT is invoked using the following 
format : 


CALL EWAIT (name) 


name Name of an integer variable used as an event 


Example : 

SUBROUTINE MULTI 2 
INTEGER EVSTART , EVDONE 
COMMON /EVENTS/ EVSTART, EVDONE 
C 

CALL EWAIT (EVSTART) 

C 

END 


3.4.4 EVPOST 

EVPOST posts an event and returns control to the calling task. Posting 
an event allows all other tasks waiting on that event to resume 
execution, but whether or not this happens is transparent to the task 
calling EVPOST. Posting a posted event has no effect (posts are not 
queued) and should be avoided. EVPOST is invoked using the following 
format: 


call evpost (name) 


name Name of an integer variable used as an event 
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Example : 

PROGRAM MULTI 

INTEGER EVSTART , EVDONE 

COMMON /EVENTS/ EVSTART , EVDONE 

C ... 

CALL EVASGN (EVSTART) 

CALL EVASGN (EVDONE) 

C ... 

CALL EVPOST (EVSTART) 

END 


3.4.5 EVCLEAR 

EVCLEAR clears an event and returns control to the calling task. This 
function causes tasks subsequently performing EVWAIT calls to wait; if 
not cleared, the posted condition remains outstanding. When a single 
event post is required (a simple signal) , EVCLEAR should be called 
immediately after EVWAIT to note that the posting of the event has been 
detected. EVCLEAR is invoked using the following format: 



name Name of an integer variable used as an event 


Example : 

SUBROUTINE MULT I 2 
INTEGER EVSTART, EVDONE 
COMMON /EVENTS/ EVSTART, EVDONE 
C 

CALL EVWAIT (EVSTART) 

CALL EVCLEAR (EVSTART) 

C 

END 
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3.4.6 EVREL 


EVREL releases the identifier assigned to an event. If a task is waiting 
for the event, an error results. This subroutine is useful primarily to 
detect erroneous uses of an event outside the region the program has 
planned for it. The event variable can be reused following another call 
to EVASGN. EVREL is invoked using the following format: 


CALL EVREL (name) 


name Name of an integer variable used as a event 


Example : 

PROGRAM MULTI 

INTEGER EVSTART , EVDONE 

COMMON /EVENTS/ EVSTART , EVDONE 

C ... 

CALL EVASGN (EVSTART) 

CALL EVASGN (EVDONE) 

C 

CALL EVPOST (EVSTART) 

C 

C EVSTART WILL NOT BE USED FROM NOW ON 

CALL EVREL (EVSTART) 

C 

END 


3 . 5 TUNING 


The multitasking system software design allows tuning by the user without 
the need to rebuild libraries or other system software. The tuning is 
performed through a library routine that can be called within user code 
as many times and in as many locations as necessary. 


3.5.1 TSKTUNE 

TSKTUNE modifies tuning parameters within the library scheduler. Each 
parameter has a default setting within the library and can be modified at 
any time to another valid setting. 
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This routine should not be used when multitasking on a CRAY-1 Computer 
System. 

The effects of this routine may not be measurable in a batch environment 
due to variable conditions between and during runs. TSKTUUE is invoked 
using the following format: 


CALL TSKTUNE ( keyword. i, valuei • keyword 2 , value 2 , . ..) 


keyword £ An ASCII character string , as follows: 

1 MAXCPU 1 Maximum number of COS logical CPUs allowed 
for the job. For a CRAY X-MP, the initial 
value is 2; for a CRAY-1, it is 1. The 
value of MAXCPU may range from 1 to a site 
installation parameter (IQMAXNUT) limiting 
the number of tasks in the system. 


1 DBRELEAS 1 

Deadband for release of logical CPUs. If 
more logical CPUs are allocated to the job 
than there are tasks, the deadband reflects 
the maximum number that are retained. Any 
in excess of this number are released to the 
system, requiring an exchange to COS. The 
initial value is 1. The value of DBRELEAS 
may range from 0 (representing immediate 
return) to the value of MAXCPU. 

' DBACTIVE 1 

Deadband for activation or acquisition of 
logical CPUs. This is the number of 
additional user tasks that can be readied 
for execution before an additional logical 
CPU is activated or acquired. This allows a 
queue of tasks to be built before another 
CPU is acquired to process these tasks. The 
value of DBACTIVE may range from 0 (the 
number of logical CPUs are equal to the 
number of user tasks limited by MAXCPU) to 
the largest integer value. The initial 
value is 0. 
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1 HOLDTIME 1 

Number of clock periods to hold a logical 
CPU while waiting for tasks to become ready 
and before releasing the CPU to the 
operating system. This parameter allows a 
user to hold additional logical CPUs in a 
job when executing a nonmultitasked section 
of code and to have these CPUs quickly 
available when the program reenters a 
multitasked mode. The value of HOLDTIME may 
range from 0 (return logical CPUs 
immediately to the operating system) to the 
largest integer value. The initial value is 
100,000 clock periods. 

’SAMPLE 1 Number of clock periods between checks of 
the ready queue. This parameter is used 
with the HOLDTIME parameter. SAMPLE adjusts 
the frequency of sampling the ready queue 
when a logical CPU is waiting for a task to 
become ready. If the ready queue is sampled 
too often, excess memory contention may 
result. The value of SAMPLE may range from 
0 (sample the ready queue as often as 
possible) to the largest integer value (the 
ready queue is effectively never sampled and 
ready tasks are never executed by the 
waiting processor) . The initial value is 
500 clock periods. 


value i An integer 


The parameters must be specified in pairs, but the pairs may be listed in 
any order. 


******************************************************* 

CAUTION 

In the COS 1.13 release, TSKTUNE does not check that a 
value passed to it is within its specified range. 

******************************************************* 
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The tuning parameters are closely related* In general, the settings in a 
dedicated environment should be different from those in a batch 
environment* In a dedicated environment, the deadbands and the loop 
counts can be high. In general, the cost of the wasted CPU cycles is 
better than the cost of constantly returning to COS to change the number 
of tasks. In a batch environment, the deadbands and loop count should be 
low (the default settings) , since the wasted CPU time could degrade total 
system throughput* In either case, COS accounting charges the job for 
any time spent idling in an unused CPU. 

The value of MAXCPU in a dedicated environment should be set to either 
the number of physical CPUs or the number of physical CPUs plus one. The 
latter can improve performance in a design with one task performing I/O 
and the remaining tasks performing computations. In a batch environment, 
the value should be kept low but is of less consequence* 


Examples : 

CALL TSKTUNE ( 1 DBACTIVE 1 , 1) 

CALL TSKTUNE ( 1 HOLDT IME 1 , 0 , 1 MAXCPU 1 , 1 ) 


The first example keeps one more user task than logical CPU. The second 
cuts back to one logical CPU as soon as possible. 


3.5.2 LDR AND SEGLDR MEMORY MANAGEMENT TUNINGS 

LDR and SEGLDR include parameters and directives dealing with memory 
stacks and heaps. These parameters are described in the COS Reference 
Manual, CRI publication SR-0011, for LDR and in the SEGLDR Reference 
Manual, CRI publication SR-0066, for SEGLDR. Table 3-1 summarizes the 
options. 


Table 3-1 Summary of Options 



LDR 

SEGLDR 

Function 

Parameter 

Directive 

Define initial stack size 
and increment 

STK 

STACK 

Define initial heap size 
and increment 

MM 

HEAP 

Define minimum size of a 
free heap block 

MMEPS 

HEAP 
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The following examples show settings that should provide enough space for 
a multitasked program with a few tasks and a moderate depth of subroutine 
use. The examples include increments in case of stack overflow. Optimal 
initial settings are, however, job and application dependent. 


LDR parameters: 

LDR,MM=15000 : 5000 , STK=5000 : 1000 , • . . 

SEGLDR directives: 

HEAP=15000+5000 
STACK=5000+100 0 

While developing and debugging a multitasked code, optimal settings for 
the initial and increment settings should be determined if possible. If 
space is not a concern, slight performance improvements can be had by 
calculating or determining the high water mark for required space for the 
stack and heap. This may not be easy because requirements can change 
from run to run due to different execution sequences of the tasks. (For 
the heap, the subprograms I HP S TAT and HPDUMP can be called from a user 
program to obtain current heap statistics.) 

If parameters are set to meet the maximum storage requirements, the 
allocation of the memory space occurs at load time rather than at run 
time. Some slack can be built into the initial request or the increment 
setting can be set to nonzero to catch cases requiring additional space. 
If the increment is set to zero, an attempted expansion or overflow 
aborts the job. 

The size of the minimum heap block is not likely to have a noticeable 
effect on performance. It prevents over-fragmentation of the heap 
manager's free space queue, . which could lengthen search times for new 
blocks. This would be a problem only if a large amount of dynamic and 
varying use is made of the heap manager. 


NOTE 

As mentioned earlier, the use of multitasking with 
either LDR overlays or SEGLDR segments is not 
specifically supported. The above parameters apply to 
normal loads with LDR or nonsegmented loads with SEGLDR. 
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3.6 MULTITASKING EXAMPLE 


This section expands on the example introduced in section 2 by providing 
CFT code that implements the multitasking parts of the application. The 
processing parts of the code are indicated by CFT comments only. 


3.6.1 COS JCL 

The following is JCL that could be used to run the example as a COS job. 

JOB , JN=MTEXAMP , . . . 

ACCOUNT , . . . 

MULTI. Access multitasking/stack libraries 

CFT , ALLOC=STACI< • Code in $IN 

LDR,MM=15000 : 5000 ,STK=5000: 1000. Allow stack and heap to grow as 

needed 


3.6.2 INITIAL TASK 

The code for the main task is as follows: 

PROGRAM EXAMPLE 
C 

EXTERNAL OUTPUT , PROCESS 
C 

INTEGER OUTTCA ( 2 ) ,PR1TCA(2) ,PR2TCA(2) 

C 

COMMON /TCAS /OUTTCA , PR1TCA , PR2TCA 
C 

INTEGER DATAOUT , OUTDONE 
COMMON /EVENTS /DATAOUT , OUTDONE 
INTEGER DATALOC 
COMMON /OUTDATA/DATALOC 
C 

REAL DATAFILE (100000) 

COMMON /DATASET/DATAFILE r START! , START 2 
DATA STARTl f START2/l, 50001/ 

C 

C ... (open input dataset) 

C 

C Initialize control variables 

C 

CALL EVASGN (DATAOUT) 

CALL EVASGN (OUTDONE) 
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c 

OUTTCA(l) =2 
PR1TCA (1) =2 
PR2TCA (1) =2 
C 

C Start a task to write the output unit 

C 

CALL TSKSTART (OUTTCA, OUTPUT) 

C 

C Enter main loop 

C 

100 CONTINUE 
C 

C Read unit of data (if end of data reached, leave loop) 

C 

C ... (read from input dataset to DATAFILE, END=1000) 

C 

C Perform some preprocessing that cannot be multitasked 

C 

C 

C Start two tasks, each of which processes 1/2 the data 

C 

CALL TSKSTART (PR1TCA, PROCESS, START1) 

CALL TSKSTART (PR2TCA, PROCESS , START 2) 

C 

C Wait for the tasks to complete 

C 

CALL TSKWAIT (PR1TCA) 

CALL TSKWAIT (PR2TCA) 

C 

C Perform some post-processing that cannot be multitasked 

C 

C Wait for the output task to finish previous I/O 

C 

CALL EVWAIT (OUTDONE) 

CALL EVCLEAR (OUTDONE) 

C 

C Signal the output task to write the next unit of data 

C 

DATALOC = . . . 

CALL EVPOST (DATAOUT) 

C 

C Perform next iteration of the loop 

C 

GOTO 100 
C 

1000 CONTINUE 
C 

C Wait for the output task to write previous unit of data 

C 

CALL EVWAIT (OUTDONE) 
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c 

C Signal the output task that all data has been supplied 

C 

DATALOC = - 99999 
CALL EVPOST (DATAOUT) 

C 

END 


3,6.3 OUTPUT TASK 


The following task performs the output operations: 

SUBROUTINE OUTPUT 
C 

INTEGER DATAOUT, OUTDONE 
COMMON /EVENTS/DATAOUT, OUTDONE 
C 

INTEGER DATALOC 
COMMON /OUTDATA/DATALOC 
C 

C Initialize control variables, open the dataset 

C 

C Begin main loop 

C 

100 CONTINUE 
C 

C Signal ready condition 

C 

CALL EVPOST (OUTDONE) 

C 

C Wait for the initial task to supply data 

C 

CALL EVWAIT (DATAOUT) 

IF (DATALOC .LT. 0) GOTO 1000 
CALL EVCLEAR (DATAOUT) 

C 

C Output data (based on DATALOC) 

C 

C Perform next iteration of loop 

C 

GOTO 100 
C 

1000 CONTINUE 
C 

RETURN 

END 

C 
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non 


3.6.4 PROCESSING TASK 


The following task perforins the actual data processing: 

SUBROUTINE PROCESS (IPTR) 

C 

REAL DATAFILE (100000) 

COMMON /DATASET/DATAFILE 
C 

C Initialize control variables 

Process data (based on IPTR) 

RETURN 
END 
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DESIGN DESCRIPTIONS 


4 


This section presents an overview of the design of the multitasking 
library subroutines and provides detail on the library subroutines. 


4.1 LIBRARY SCHEDULER 


The multitasking library takes primary responsibility for managing and 
scheduling tasks within a program. This approach offers advantages in 
the following areas: 

« Performance. Multiple operations can be performed at the library 
level without requiring calls to COS. 

• Tunability. The library can be tuned for individual user 
programs , and it can be tuned differently for different programs 
running at the same time. 

• Flexibility. Making library changes is easier than making COS 
changes. 

• Ease of use. The user is not required to maintain queues or task 
information nor need he program in CAL to use the hardware 
multitasking features. 


4.1.1 LOGICAL CPU 

The key concept in the COS interface to the library scheduler is that of 
the logical CPU. (The logical CPU is referred to in COS documentation as 
a user task.) A logical CPU is the entity scheduled by COS for execution 
on physical CPUs and is identified as an entry in the COS Task Execution 
Table (TXT) . 

Initially, COS assigns a job one logical CPU. But the library scheduler 
can request additional logical CPUs for a particular job, thereby 
bringing about multitasking at the COS level. The number of logical CPUs 
need not, however, equal the number of tasks active in the user job. The 
maximum number of logical CPUs is a major tuning component of the library 
scheduler (discussed in section 3) . 
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The job of the library scheduler, therefore, is to connect user tasks to 
logical CPUs in the most efficient manner. If a task must wait for a 
lock or an event, that task is disconnected from its logical CPU. The 
logical CPU is then freed for use by another task in the job or possibly 
for return to the system. 

In many multitasking applications, the concept of the logical CPU may 
seem redundant or an unnecessary detail. For example, if only two tasks 
are active on a CRAY X-MP Computer System, two logical CPUs will likely 
be allocated and used by the library scheduler. The concept becomes 
important when more tasks are defined than physical CPUs available (on a 
CRAY X-MP or a CRAY-1 Computer System) . Generally, the number of logical 
CPUs should not be greater than the number of physical CPUs to allow task 
scheduling to take place at the higher (and faster) level. 


4.1.2 QUEUE MANAGEMENT 

The library scheduler manages several queues of tasks. Tasks are moved 
between queues as their states change through the use of the multitasking 
facilities. 

The queues are generally handled in first-in, first-out (FIFO) order. 
However, when a task calls any of the multitasking subroutines, it is 
placed in the front of the Waiting for Logical CPU queue. If a task is 
moved from one queue to another, it is placed at the end of the new 
queue. The following discussion describes only two queues: the Waiting 
for Logical CPU queue and the Suspended queue. (Multiple Suspended 
queues are actually implemented.) 

A number of the library subroutines exit through the library scheduler. 
The scheduler selects the first task in the Waiting for Logical CPU 
queue. If the task selected is the one that made the library call, that 
task is already connected to a logical CPU. The scheduler simply returns 
and execution resumes in the user program. If a task other than the one 
that made the call is selected, registers for the task are loaded and 
execution resumes in the new task. 


NOTE 

B and T registers are explicitly loaded by the library 
scheduler; A, S, and V registers are loaded as needed 
within the user code. Because the multitasking 
features are implemented as library subroutines, CFT 
and CAL programmers cannot assume the contents of A, S, 
and V registers are preserved across the call or that 
the generated code or the CAL code performs its own 
reloading as required. 
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4.2 KEY LIBRARY SUBROUTINES 


The following sections describe the key library subroutines in detail. 


4.2.1 TSKSTART 

TSKSTART builds a stack for the task. It copies initial information from 
the task control array into a Task Information Block in the base of the 
stack. The task is then placed in the Waiting for Logical CPU queue, and 
control passes to the library scheduler. 


4.2.2 TSKWAIT 

TSKWAIT checks the status of the specified task. If the task has 
completed execution, control returns to the calling task. If the task is 
active, the calling task is placed in a Suspended queue* the identifier 
of the task for which it is waiting is saved, and control passes to the 
library scheduler. 


4.2.3 LOCKON 

LOCKON checks the status of the lock variable. If the lock variable is 
unlocked, the subroutine locks it and returns control. If the lock 
variable is locked, the calling task is placed in a Suspended queue, the 
identifier of the lock for which the task is waiting is saved, and 
control passes to the library scheduler. 


4.2.4 LOCKOFF 

LOCKOFF changes the status of the lock variable to unlocked. LOCKOFF 
then removes the first task waiting for that lock from a Suspended queue 
and puts it in the Waiting for Logical CPU queue. Control passes to the 
library scheduler. 


4.2.5 EVWAIT 

EVWAIT checks the status of the event. If the status is posted, control 
returns to the calling task without further action. If the status is 
cleared, the task is put in a Suspended queue, the identifier of the 
event for which it is waiting is saved, and control passes to the library 
scheduler. 
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4.2.6 EVPOST 


EVPOST changes the status of the event variable to posted. EVPOST then 
removes all tasks waiting for that event from a Suspended queue and puts 
them in the Waiting for Logical CPU queue. Control passes to the library 
scheduler . 


4.2.7 EVCLEAR 

EVCLEAR changes the status of the event variable to cleared, and control 
returns to the calling task. 


4.3 STATE TRANSITIONS 

The multitasking routines and library scheduler described above cause 
user tasks to move from state to state over the course of a job. Figure 
4-1 shows these transitions. 


| Connected I 

1 to physi- | (1) 

I | cal CPU | 

| 

I 


1 Connected to log- 1 

I ical CPU, waiting | (2) 

1 for physical CPU | 


I I Waiting I I Non- I 

1 for log- I 1 existent | (4) 

I ical CPU I (3) I | 


I Suspended I 
I for I (5) 

| LOCK | 


| Suspended I 
I for I (6) 
| EVENT | 


| Suspended I 
I on I (7) 

I TSKWAIT | 


Figure 4-1. Transitions of user tasks 
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The descriptions of transitions between states in figure 4-1 are as 
follows: 

(1) — > (2) Because of interrupt, COS removes task from physical 

CPU. 


( 2 ) — > ( 1 ) 

( 2 ) — >( 3 ) 

( 3 ) — > ( 2 ) 

( 3 ) ~>( 4 ) 

( 4 ) — ->( 3 ) 
( 3 )~>( 5 ) 

( 5 ) — >( 3 ) 

( 3 ) — > ( 6 ) 

( 6 ) — >( 3 ) 

( 3 ) — > ( 7 ) 

( 7 ) — >( 3 ) 


COS selects task for execution on physical CPU, using 
the COS job scheduler (JSH) parameters and algorithms. 

A multitasking routine is called. (This transition 
may be transitory and may be immediately followed by 
the .opposite transition.) 

Library scheduler assigns the task to a logical CPU, 
using the internal FIFO queue. 

Task completes execution. 

Task created through a call to TSKSTAJRT . 

Task executes LOCKON for a lock that is on already. 

Some other task executed LOCKOFF, and this task is 
selected to receive the lock. 

Task executes EVWAIT for an event that is not posted. 

Another task posts the event for which this task is 
waiting. 

Task executes TSKWAIT for an existing task. 

The task being waited for completes execution. 
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3UULT3TASXIMG BASICS 


5 


Multitasking is a tool for speeding up the apparent execution time of 
programs. An understanding of how and when the tool should be used and 
what can be gained by using it is important. This understanding is based 
on concepts that may not previously have concerned the programmer but are 
now of fundamental importance. 

Multitasking does not reduce the CPU cycles necessary to execute the 
program. In fact, multitasking introduces an overhead that increases CPU 
time. Hence, the number of calls to the multitasking library must be 
minimized. To reduce overhead, parallelism should be exploited at the 
highest level possible. 

Multitasking reduces the elapsed (turnaround) time of a single program 
executed in a monoprogramming (dedicated) environment. This is the major 
benefit of multitasking, and it happens because all processors are 
available to simultaneously consume the CPU time needed for the program. 
Production work profits from multitasking in this mode. 

In a multiprogramming (batch) environment, the tasks of a multitasking 
job compete with other jobs in the system for the processor resources. 

The speedup of the multitasking job depends on how successful the job is 
in this competition. Job priority may favor multitasking jobs over other 
jobs for improved performance. Experimental and development work profit 
from multitasking in this mode. 


5.1 COMPUTATIONAL AND STORAGE DEPENDENCE 


The ability to execute tasks in parallel requires that each task is 
independent of other tasks. This independence comes in several forms, 
and tasks must be independent in all forms before they can be 
multitasked. This section describes independent iterations of loops. 
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5.1.1 COMPUTATIONAL DEPENDENCE 


Computational dependence is composed of data dependence and control 
dependence. These forms of dependence can be analyzed by looking at 
dependence graphs. Dependence graphs are a pictorial representation of 
the relationships between statements in a program. They are useful in 
understanding constraints imposed by a program to preserve orderly 
(deterministic) and correct results. The analysis of dependence graphs 
shows when the opportunity for vector ization and multitasking exists. 

Dependence graphs are composed of nodes representing the statements in a 
program and directed arcs connecting nodes representing the ordering 
constraints. We are concerned here with assignment statements , 
assignments within loops, and I/O statements. The various types of 
dependencies between statements are described in the following 
subsections. 


Data dependence 


Data dependence is an ordering relationship between statements that use 
or produce the same data. The ordering must be preserved to generate 
correct results. Analysis of data dependence determines if the 
statements can be vectorized or parallelized without violating this 
ordering. The types of data dependence are as follows: 

• Flow dependence 

• Anti-dependence 

• Output dependence 

• Unknown dependence 

• I/O dependence 

Flow dependence - Statement s2 is flow-dependent on statement si if 
an execution path exists from si to s2 and if the same variable 
(scalar or array element) is assigned in statement si and used in 
statement s2 • A directed arc connects node si to a flow-dependent 
node s2 in a dependence graph. 


Example : 

DO 10 I = 1,N 

1 A ( I) = B ( I) + 5. 

2 C ( I) = A ( I) * 3. 

10 CONTINUE 

In this example, statement s2 is flow dependent on statement si. The 
order of these two statements (computations) must be preserved so that 
the value of A(I) is produced before it is used. Vector ization or 
parallelization of this loop does not violate this ordering. 


si 

Y 

s2 
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Remembering that FORTRAN DO-loops are just a shorthand notation for 
repeating similar statements, note that the original dependence graph is 
really a compact form of many similar dependence graphs, one for each 
iteration of the loop. The work tableau in figure 5-1 illustrates this 
concepts 


i 

Statements | 

I 


Figure 5-1. Flow dependence permitting vector ization 
or multitasking 

Figure 5-1 contains no dependence arcs leading from one iteration to 
another. The loop has 'independent iterations , a property that enables 
any implementation of the program to execute the iterations in any order 
and still obtain the same correct answers. In particular, this property 
enables a multitasking implementation of the program to execute the 
iterations of the loop simultaneously on multiple processors. 


Iterations 

1=1 1=2 1=3 ... I=N 


si si si ... si 

111 1 

s2 s2 s2 ... s2 


Example: 



DO 10 I = 1, N 

sl 

1 

A (1+1) = B (I) + 5. 

t 

2 

B (1+1) = C(I) * 3. 

! 

10 

CONTINUE 

s2 


In this example, statement si is flow-dependent on statement s2 
across iterations of the loop. The value of B(I+1) must be computed 
before it is used in the next iteration. The loop cannot be vectorized 
or multitasked unless it is optimized by reordering statements. 

Looking at the work tableau for this loop (figure 5-2) shows why the loop 
does not have independent iterations. 


Iterations 


1 

1=1 

CNJ 

II 

w 

1=3 

. . . I=N 

1 

1 

sl 

sl 

sl 

... sl 

Statements | 


/ 

/ 


1 

/ 

/ 

/ 

/ 

1 

s2 

s2 

s2 

... s2 


Figure 5-2. Flow dependence prohibiting vectorization 
or multitasking 
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The presence of dependence arcs between iterations prevents the 
independent execution of the iterations of the loop. 

Anti-dependence - Statement s2 is anti-dependent on statement sl if 
an execution path exists from statement si to statement s2 and if the 
same variable (scalar or array element) is used by statement si and 
assigned by statement s2 . A directed arc connects node si to node 
s2 in the dependence graph. 


Example: 



DO 10 I = 1, N 

sl 

1 

A(I) = B (I) 

1 

2 

B ( I) = C(I) + 2. 

T 

10 

CONTINUE 

s2 


In this example, statement s2 is anti-dependent on statement si. The 
order of these two statements (computations) must be preserved so that 
the value of B ( I ) is used before it is redefined. Vector ization or 
parallelization does not violate this ordering (as illustrated earlier in 
figure 1) . 


Example : 



DO 10 I = 1, N 

sl 

i 

A(I) = 2. 

t 

2 

B (I) = A(I+1) + 1. 

1 

10 

CONTINUE 

s2 


In this example, statement si is anti-dependent on statement s2 
across the iterations of the loop. The loop cannot be vectorized or 
multitasked unless the statements are reordered (as illustrated earlier 
in figure 5-2) . 

Output dependence - Statement s2 is output dependent on statement si 
if an execution path exists from statement si to statement s2 and if 
the same variable (scalar or array element) is assigned in both 
statements. A directed arc connects statement si to statement s2 in 
the dependence graph. 


Example : 



DO 10 I = 1, N 

sl 

1 

B (I) = C(I) 



2 

B ( I) = D (I) * W 

\ 

1 

10 

CONTINUE 

s2 
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In this example, statement s2 is output dependent on statement si. 

The order of these two statements (computations) must be preserved so 
that after the loop has completed, B(I) has the value assigned in 
statement s2 rather than statement si. Vector ization or 
parallelization of this loop does not violate this ordering. 

Unknown dependence - Sometimes the dependence relationship between 
statements cannot be determined. This can happen in cases such as the 
following : 

to A subscript is subscripted (indirect addressing) . 

• The subscript does not contain the loop index variable. 

m A variable appears more than once with subscripts having different 
coefficients of the loop variable. 

to The subscript is nonlinear in the loop index variable. 

When one of these cases occurs, the conservative assumption that a 
dependence exists must be followed. Frequently this implies that neither 
vector ization nor multitasking is possible. 

I/O dependence - A READ statement can be considered equivalent to an 
assignment statement in which the variable being read is on the left-hand 
side. Likewise, a WRITE statement can be considered equivalent to an 
assignment statement in which the variable being written is on the 
right-hand side. Thinking of READ and WRITE statements in this way, I/O 
statements can have data dependencies with assignment statements in the 
same way that assignment statements depend on one another. 


Example : 


DO 10 I = 1, N 

1 READ *,A(I) 

2 Q ( I) = A (I) 

3 A (I) = C (I) 

4 WRITE * , A ( I) 

10 CONTINUE 



t 


si — > s2 — > S3 — > s4 


The following dependencies are present in this example: 


• 

s2 

is 

• 

sS 

is 

• 

s4 

is 

• 

S3 

is 

to 

s4 

is 


flow dependent on si. 
anti-dependent on s2. 
flow dependent on s3. 
output dependent on si. 
flow dependent on si. 
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Additionally , I/O statements can exhibit a different kind of dependence 
(I/O dependence) on other I/O statements. This dependence occurs not 
because the same variable is involved , but because the same file is 
involved. 


I/O statement s2 is I/O dependent on I/O statement s 1 if an execution 
path exists from si to s2 and if the file referenced in both I/O 
statements is the same. A directed arc connects node si to s2 in the 
dependence graph. In the example above, additional arcs would lead from 
si to si and from s4 to s4 t since different iterations access the 
same files. 


Example : 


DO 10 I = 1, N 

1 READ (3) , A (I) 
REWIND (3) 

2 WRITE (3) , B ( I) 
REWIND (3) 

10 CONTINUE 


! 1 

si < — s2 


In this example, statements si and s2 are I/O dependent on each other 
because they both access file 3. We do not consider vectorizing I/O. 

This example cannot be modified to be executed in parallel. 

Attempting to parallelize I/O at the loop level may require a significant 
change to the mechanisms for I/O to preserve the orderliness of the file 
and correct results. For example, sequential write can be changed to 
random access, either explicitly through the I/O call or by storing the 
records into reserved slots in a memory buffer, which is then written 
when all iterations have completed. 


Examples: 


The following loops have data-independent iterations (no dependence arcs 
between iterations in the work tableau) : 

DO 10 I = 1, N 

A (I) = B (I) + C ( I ) 

D (I) = A (I) + E ( I) 

B ( I) = F ( I) 

B (I) = B (I) + G (I) 

10 CONTINUE 


DO 30 J = 1, N 
DO 20 I = 1, N 

A ( I , J) = B ( I , J) + C(I,J) 
20 CONTINUE 
3 0 CONTINUE 
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The 30 loop and the 20 loop in the above example have data-independent 
iterations. 


DO 40 I = 1, N 

A (I) = SSUI»I(N,B(1,I) ,1) 
40 CONTINUE 


DO 60 J = 1, N 

S(J) = FLOAT (J) 

DO 50 I = 1, J 

A (I, J) = FUNCT (S ( J) ,1) 

50 CONTINUE 

60 CONTINUE 

FUNCTION FUNCT (S, I) 

FUNCT = S + FLOAT (I) 

RETURN 

END 

The data independence of the iterations of either loop in the above 
example cannot be determined without analyzing the FUNCT subprogram. 

Here the iterations of both loops are independent because the subroutine 
does not redefine its input (no side effects) . 

The following loops have data-dependent iterations (dependence arcs 
present between iterations in work tableau) : 

S = 0.0 

DO 10 I = 1, N 
S = S + A ( I) 

10 CONTINUE 


DO 20 I = 1, N 

X(I) = A(I) *X(I— 1) + B (I) 
20 CONTINUE 


DO 40 J = 1, N 
A ( J) = 0.0 
DO 30 I = 1, N 

A ( J) = A ( J) *C (I) + B(I,J) 

30 CONTINUE 

40 CONTINUE 

While the 30 loop has data-dependent iterations, the 40 loop has 
data-independent iterations. 
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DO 60 J = 1, N 
S(J) = FLOAT (J) 

DO 50 I = 1, J 

A(I, J) = FUNCT (S ( J) ,1) 

50 CONTINUE 

60 CONTINUE 

FUNCTION FUNCT (S, I) 

S = SQRT(S) 

FUNCT = S + FLOAT (I) 

RETURN 

END 

The 60 loop has data-independent iterations, but the 50 loop does not. 


Control dependence 

Control dependence refers to the situation where the order of execution 
of statements cannot be determined before run time. Such is the case 
when conditional statements (for example, IF) appear in a program. 
Conditional statements conditionally introduce or conditionally eliminate 
data dependence among statements. 

Because computational independence must be guaranteed when attempting 
multitasking, the conservative approach is necessary here. The 
programmer must assume that any conditional statement may or may not 
cause the execution of its object statement. One must analyze all 
possible execution paths through the code to be multitasked. All data 
dependencies are maintained, and none are eliminated. 

If an IF statement tests a value computed earlier in the same iteration, 
then the testing does not introduce a data dependence among iterations. 

If an IF statement tests a value computed in another iteration, then the 
iterations are dependent and cannot be done in parallel. 

If the object statement of an IF statement assigns a variable used later 
in the same iteration, then the assignment does not introduce data 
dependence among iterations. If the object statement of an IF statement 
assigns a variable used in another iteration, then the iterations are 
dependent and cannot be executed in parallel. 


Examples : 

The following loops have control-independent iterations: 

DO 10 I = 1, N 
A (I) = B (I) 

IF ( A (I) .LT.0.0 ) A(I) =0.0 
10 CONTINUE 
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15 


DO 20 I = 1, N 

IF ( I.LT.10 ) GO TO 15 
IF ( I.GT.20 ) GO TO 16 
A(I) = B ( I ) 

GO TO 20 
16 A ( I) = C(I) 

20 CONTINUE 


DO 30 J = 1, N 

IF ( A(J).EQ.B(J) ) GO TO 26 
DO 25 I = 1, J 

IF ( C(I).EQ.A(J) ) GO TO 25 
D(I,J) = A ( J) 

25 CONTINUE 

26 CONTINUE 
30 CONTINUE 


The following loops have control-dependent iterations: 

DO 40 I = 1, U 

IF ( A(I-l) .EQ.0.0 ) A (I) = 1.0 
40 CONTINUE 

K = J 

DO 50 I = 1, N 
A(I) = 1.0 

IF ( B (I) .EQ.0.0 ) A (K) = 0.0 
K = K + 1 
50 CONTINUE 

DO 70 J = 1, N 
DO 60 I = 1, N 

S(J) = AMAX (S ( J) ,X(I) ) 

60 CONTINUE 
70 CONTINUE 

In the last example, while the 60 loop has control-dependent iterations, 
the 70 loop has control-independent iterations. 


5.1.2 STORAGE DEPENDENCE 

While computational dependence is concerned with the independence of the 
work to be done, storage dependence is concerned with the independence 
of workspace. Each parallel computational task has access to variables, 
and the fetching and storing of all variables in one task must not 
interfere with that in another task. Each task must work on independent 
storage locations or use special access mechanisms (for example, locks) 
to guarantee the safe modification of shared variables. 
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Storage dependence is a data dependence between the iterations of a loop 
caused by the extent of a left hand side variable being less than the 
index range of the loop considered for multitasking. The same storage 
location is reused and redefined in each iteration. This causes a cycle 
in the dependence graph for the loop and introduces dependence arcs 
between iterations in the work tableau. 

Data dependence between iterations can frequently be eliminated by a 
program modification that provides each iteration (or the partition of 
iterations belonging to each task) with a separated storage location 
corresponding to the original variable. 

Storage dependence is easily overlooked and often difficult to identify 
as a multitasked program bug. 


Examples: 

The following loops have storage-independent iterations: 

DO 10 I = 1, il 

A(I) = B ( I) + C ( I) 

10 CONTINUE 

DO 20 I = 1, N 

IF( A ( I) .EQ.0.0 ) GO TO 20 
B (I) = C(I)/A(I) 

20 CONTINUE 

DO 30 J = 1, N 
DO 29 I = 1, N 
A(I,J) = 0.0 
DO 28 K = 1, N 

A ( I , J) = A (I , J) + B (I ,K) *C (K, J) 

28 CONTINUE 

29 CONTINUE 

30 CONTINUE 

In the last example, the 30 and 29 loops have storage-independent 
iterations, but the 28 loop does not. The variable A(I,J) has an extent 
one over the range of K. 

The following loops have storage-dependent iterations: 

S = 0.0 

DO 40 I = 1, N 
S = S + A(I) 

40 CONTINUE 
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DO 70 K = 1, N 
DO 50 J = 1, N 



A ( J) = B ( J/K) 

+ C (J,K) 

50 

CONTINUE 



DO 60 I = 1, N 



D (I r K) = A(I) 

+ 1.0 

60 

CONTINUE 


70 

CONTINUE 



In this example, the 50 and 60 loops have storage- independent iterations, 
but the 70 loop does not. The variable A(J) has an extent one over the 
range of K. 


5.1.3 GENERAL I Z AT IONS 

Applying the concepts of statement level dependence at a higher level, 
say at the code segment, subroutine, process, or task level, is 
frequently useful. The dependence of two higher level objects can be 
inferred from the dependence of the statements in one to the statements 
in the other. 

The goal of analyzing the dependencies in a code is to identify the 
variables and program constructs that need modification before 
multitasking can take place. Section 6 contains more discussion on 
analysis, and section 7 describes conversion techniques. 


5.2 SCOPE 

The scope of a variable is the region of the program in which the 
variable is defined and can be referenced. The traditional regions in 
v/hich a variable has scope include statement, program unit, and 
executable program. Outside of a variable's scope, the variable is not 
defined and references to the variable's name refer to a different 
variable and a distinct memory location. 

The scope boundaries that delineate a variable's scope are easy to 
recognize. They may be the beginning and end of a statement, the first 
and last line in a program unit, or the first and last line of a program. 

The multitasking conversion process introduces a new type of scope 
region, called a block. Unlike the traditional scope regions, which have 
fixed definitions of boundaries, the block scope region is flexible to 
some extent. The portion of code making up a block is related to the new 
scope boundaries introduced by multitasking. 
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Since the user has the option of applying multitasking at different 
levels of parallelism, the boundaries defined by a block need to be 
flexible enough to handle all cases. The minimum block scope for a 
variable includes the segment of code starting with the statement which 
first defines (assigns) a variable and ending with the statement that 
last refers to the variable (assuming straight line code) . For more 
complicated execution flow, these notions are easily generalized. A 
block scope region may be a single statement, a DO loop, a program 
segment within a program unit, a program unit, portions of several 
program units, or the whole executable program. 

When one partitions a program into pieces to be performed by several 
CPUs, one takes a set of independent iterations of a loop to form a task 
subroutine. The effect of this partitioning is to introduce subroutine 
scope boundaries in the program where a DO-loop and CONTINUE statements 
occur. 

Compare the introduced subroutine scope boundaries with the minimum block 
scope of each variable that appears in the original unpartitioned DO 
loop. If the minimum block scope of a variable is contained within the 
new subroutine scope, the variable is local to the loop, and separate 
storage locations for these local variables must be provided in each 
partition of the iterations of the loop. 

Further analysis and program modifications are required for variables 
whose minimum block scope extends beyond the new subroutine scope 
boundaries. Sections 6 and 7 offer more information on this. 


Examples : 

Consider the following program unit: 

1 SUBROUTINE SUB 

2 

3 

4 DO 20 J = 1, N 

5 X(J) = Y ( J) + Z(J) 

6 DO 10 I = 1, N 

7 A(I, J) = B (I , J) + X(J) 

8 10 CONTINUE 

9 20 CONTINUE 
10 

11 END 

The 20 loop has independent iterations that affect the introduction of 
subroutine scope boundaries at lines 4 and 9. The left side variables 
are X and A. The traditional scope of A and X are program unit (lines 
1-11). The minimum block scope of A is line 7, and X is lines 5-7, 
assuming that neither A nor X appears outside the 20 loop. 
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The right side variables Y f Z, and B were assigned values before the 20 
loop if the program is correct. Clearly, their minimum block scope must 
extend beyond the new subroutine boundaries about to be introduced. 
Modifications must be made for multitasking so that their new scope 
includes the task subroutine. Alternatives are putting them in COMMON or 
passing them as arguments to the subroutine task. 

If the 20 loop is converted for multitasking, the new code may look as 
follows : 


1 


SUBROUTINE SUB 

2 


... declarations, initializations 

3 


CALL TSKSTART ( IDTASK, T, N, Y, Z 

4 


DO 20 J = 1, N/2 

5 


X(J) = Y(J) + Z(J) 

6 


DO 10 I = 1, N 

7 


A ( I , J) = B (I, J) + X(J) 

8 

10 

CONTINUE 

9 

20 

CONTINUE 

10 


CALL TSKWAIT ( IDTASK ) 

11 


END 

12 


SUBROUTINE T ( N, Y, Z, B ) 

13 


... declarations 

14 


DO 20 J = N/2+1, N 

15 


X(J) = Y ( J) + Z(J) 

16 


DO 10 I = 1, N 

17 


A (I , J) = B (I , J) + X(J) 

18 

10 

CONTINUE 

19 

20 

CONTINUE 

20 


• 

21 


END 


The elements of A and X computed by task subroutine T are not available 
in subroutine SUB. (The assumption was made that they were not needed.) 
Variables A and X in T are different than variables A and X in subroutine 
SUB. 

If the variable A in the original program were referenced after the 20 
loop, then A would also have to be passed as an argument to task T. 

Thus, what goes on outside the loop we are considering for multitasking 
has an effect on the required conversion. 
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5.3 DETERMINISM 


The order statements are executed in programs run on only one CPU is well 
defined. Repetitive runs produce identical results because the same 
instructions are executed in the same order each time. Data and control 
dependencies are satisfied simply by the location of statements in the 
program. Temporal ordering is implied by spatial ordering. 

Multitasking introduces a new dimension to the order of execution. While 
the sequence of execution in each task remains well defined, the relative 
order of task execution has no default order. In fact, the order of 
execution may change from run to run, and the programmer must control the 
ordering to satisfy dependencies. Failure to manage the temporal 
ordering of tasks is a subtle error that may be difficult to identify. 

Consider the following program segment: 

DO 10 I = 1, N 
A(I) = AINIT(I) 

10 CONTINUE 

DO 30 J = 1, N 
DO 20 I = 1, N 

B (If J) = A(I)*C(J) 

20 CONTINUE 

30 CONTINUE 

This segment is incorrectly converted to the following: 

CALL TSKSTART ( IDTASK , T, N, A, AINIT, B, C ) 

DO 10 I = 1, N/2 
A (I) = AINIT (I) 

10 CONTINUE 

DO 30 J = 1, N/2 
DO 20 I = 1, N 

B (I, J) = A (I) *C ( J) 

20 CONTINUE 

30 CONTINUE 

CALL TSKWAIT ( IDTASK ) 

SUBROUTINE T ( N, A, AINIT, B, C ) 

DO 10 I = N/2+1 , N 
A ( I) = AINIT (I) 

10 CONTINUE 

DO 30 J = N/2+1, N 
DO 20 1=1, N 

B (I , J) = A (I) *C ( J) 

20 CONTINUE 

30 CONTINUE 
RETURN 
END 
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The resulting code may not produce the right answers, because both 10 
loops may not finish before either 20 loop begins. Synchronization is 
required between the 10 and 30 loops in both tasks to ensure compliance 
of the data dependencies of A. A correct program may look as follows: 

COMMON / EVENTS / IDONE1, IDONE2 

CALL TSKSTART ( IDTASK , T, N, A, AINIT, B, C ) 

DO 10 I = 1, N/2 
A (I) = AINIT (I) 

10 CONTINUE 

CALL EVPOST ( IDONE1 ) 

CALL EVWAIT ( IDONE2 ) 

CALL EVCLEAR ( IDONE2 ) 

DO 30 J = 1, N/2 
DO 20 I = 1, N 
j B ( I , J) = A (I) *C ( J) 

20 CONTINUE 
30 CONTINUE 

CALL TSKWAIT ( IDTASK ) 

SUBROUTINE T { N, A, AINIT, B, C ) 

COMMON / EVENTS / IDONE1, IDONE2 
DO 10 I = N/2+1, N 
A (I) = AINIT (I) 

10 CONTINUE 

CALL EVPOST ( IDONE2 ) 

CALL EVWAIT ( IDONE1 ) 

CALL EVCLEAR ( IDONE1 ) 

DO 30 J = N/2+1, N 
DO 20 I = 1, N 

B (I , J) = A(I) *C(J) 

20 CONTINUE 
3 0 CONTINUE 
RETURN 
END 


5. 4 SPEEDUP FROM MULTITASKING 

Multitasking produces the best speedup when applied to balanced tasks of 
sufficient size. Speedup occurs only when multiple processors have 
something to do. Speedup also only occurs when the time saved in 
executing independent tasks in parallel outweighs the overhead penalty. 


SN-0222 


5-15 



5.4.1 TASK GRANULARITY 

Multitasking does not come free. The initiation, management, and 
interaction of tasks is accomplished by code that is not in the original 
program. This code adds to the execution time of the program and limits 
the granularity of parallelism that can be profitably exploited. The 
costs of multitasking and the size of tasks that can benefit from 
parallel execution must therefore be appreciated. 

When converting a program for multitasking, one should always look at the 
size of the task to see if multitasking will produce a speedup. At one 
end of the spectrum, we may have the following programs 

Original code: 

I 

PROGRAM MAIN 
CALL A 
CALL B 
STOP 
END 


MULTITASKING CODE 

CPU-0 CPU-1 


PROGRAM MAIN 

CALL TSKSTART (TID, A) — > SUBROUTINE A 
CALL B 

CALL TSKWAIT (TID) < RETURN 

STOP 

END 


This program can benefit from multitasking, depending on the size 
(execution time) of subroutines A and B, the difference in size between A 
and B, and the multitasking overhead. Table 5-1 looks at the other end 
of the spectrum. The task time is specified in both seconds and clock 
periods (CPs) in table 5-1. 

Parallelism exists in each of the tasks in table 5-1, with the 
granularity of parallelism increasing toward the top. The size of these 
examples is a function of N. For certain values of N, some of these 
examples may profit from multitasking, while others are too small to 
consider. We will investigate the matrix addition example. 


Subroutines A and B do 
independent operations 
on independent data. 
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Original code: 


DO 1 J = 1, N 
DO 1 I = 1, N 

A(I,J) = B (I , J) + C(I,J) 
1 CONTINUE 


Table 5-1. Sample tasks containing parallelism 


Task time 



Seconds 

CPs 

Number of 
operations 

Representative 

task 

10** (-2) 

10**6 

‘ N 4 

DO 4 I = 1, N 

4 CALL MATMUL (N, A, B,C) 

10** (-3) 

10**5 

N 3 

DO 3 I = 1, N 

DO 3 J = 1, N 

A(I, J) = 0. 

DO 3 K = 1, N 

3 A(I f J) = A(I, J) + B (I f K) *C(K, J) 

10** (-4) 

10**4 

N 2 

DO 2 J = 1, N 

DO 2 I = 1, N 

2 A ( I , J) = B(I,J) + C(I,J) 

10** (-5) 

10**3 

N 

DO 1 I = 1, N 

1 V(I) = U (I) + W (I) 

10** (-6) 

10**2 

1 

S = XI + X2 + X3 + X4 

1 

l- 


Multitasking code (static load balancing) : 

CPU-0 CPU-1 

COMMON /GLOBAL/ A,B,C 

A — > 

L = N/2 
LP1= L + 1 

B — > 

CALL TSKSTART ( IDT , T , LP1 , N) — > 

C — > < — F 

SUBROUTINE T(LP1,N) < — G 
COMMON/GLOBAL/A , B , C 
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CPU-0 


CPU-1 


DO 1 J = 1, L 
DO 1 I = 1, N 

A (I , J) =B (I , J) +C (I , J) 
1 CONTINUE 


D — > 

CALL TSKWAIT (IDT) <■ 

E — > 


DO 1 J = LP1 , N 
DO 1 I = 1/ N 

A(I f J) =B (If J) +C (1/ J) 
1 CONTINUE 

RETURN < — H 

END 


Significant events in the execution of the multitasked code are as 
follows: 

A: The experiment begins* CPU-0 executes setup code* 

B: CPU-0 calls TSKSTART. 

C: CPU-0 resumes its own processing. 

D: CPU-0' s half of work complete; CPU-0 calls TSKWAIT. 

E: The experiment ends, synchronization takes place, and CPU-0 
continues. 

F: CPU-1 becomes aware of new task. 

G: CPU-1 begins processing its half of the work. 

H: CPU-1 completes the work? the task dies. 

Figure 5-3 illustrates these events on a time line. 


CPU-1 : 
CPU-0 : 



/ + 

— > TIME 


Events : ABC 

F G 



E 


Figure 5-3. Time line for a 2-CPU multitasking example 


The following assumptions and observations can be made on the basis of 
this time line: 

• The execution time for one CPU would be 2 * (D-C) = X. 

• The execution time for two CPUs is E - A. 

• The time at event C is equal to the time at event F. 
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* The multitasking overhead is (C-A) + (E-D) for CPU-0 and (G-F) for 
CPU-1 . 

a The speedup factor for multitasking is as follows: 

2 * (D-C) Time (1-CPU) 

Speedup = = 

(E-D) + (D-C) + (C-A) Time ( 2-CPU) 

® The following are estimates for the execution times of various 
segments of the program: 

(B-A) = 50 CP 

(C-B) = 4000 CP 

(D-C) = Independent variable X/2 = 1/2 OF 1-CPU Execution time 
(G-F) = 100 CP 

(H-D) = (G-F) 

(H-G) = (D-C) 

(E-H) = 1000 CP 

C = F 

• For this multitasking model, the overhead is simply the sum of the 
individual overheads for CPU-0, since the overhead to CPU-1 
eventually causes CPU-0 to wait. The expected speedup is a 
function of the size of the original task, 

X 

Sp = Speedup = Units (X) = CP 

X/2 + 5150 


Figure 5-4 plots the speedup function for matrix addition to show when 
the multitasked code will be slower than, the same speed as, and faster 
than the original code. 

The original task execution time is related to the order of the matrix 
sum, N, and whether the computation is in scalar or vector mode. The 
important measurement of task size should be made in terms of execution 
time and not floating-point operations. The number of operations that 
can be executed in a given amount of time depends greatly on the degree 
of vector ization. 

The following calculations determine the task granularity for matrix 
addition: 

Speedup =1.0 <— > Task size = 10300 CP = 0 (10** (-4) SEC) 

N = 50 (vector) , N = 15 (scalar) 

Speedup =1.8 <— > Task size = 92700 CP = O (10** (-3) SEC) 

N = 200 (vector), N = 50 (scalar) 
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Figure 5-4. Original task execution time 


For simple multitasking models, the following formula gives the size of 
the original task required to obtain a desired speedup. The formula does 
not address portions of the program that are not multitasked and should 
not be applied to the program as a whole. 

Sp * nopu * overhead 

X 

( nopu - Sp ) 

To gain a speedup of Sp on nopu processors, with a multitasking 
overhead of overhead CPs, the original task must be at least X CPs in 
size. 


5.4.2 LOAD BALANCING 

To make the best use of processor resources and gain the most speedup, 
work must be partitioned into equal parts to run in parallel. A 
comparison of figures 5-5 and 5-6 shows why this is important. 

Techniques for load balancing are found in section 6. 
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CPU-1 



<-1/3 work 


not productive- > wait 2/3 work-> 

- < 


multiprocessing 

I 

T 

uni-processing 

i 


Figure 5-5. An unbalanced multitasking job 


CPU-0 CPU-1 

I 


* 


multiprocessing 




> 


< 1/2 work > 


<. 


Figure 5-6. A balanced multitasking job 


5 . 5 PREDICTING PERFORMANCE 


Predicting the performance of multitasked code allows the benefits of 
multitasking to be seen before conversion begins. 
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5.5.1 FACTORS AFFECTING PERFORMANCE 


Several factors influence the performance of multitasked code as compared 
to the original program. Some of these factors are changes required for 
multitasking to take place, including reentrant code generation by CFT to 
produce stack-based references for local variables. 

All tasks contend for access to shared Central Memory for their code and 
variables. 

While the above factors are largely out of the user’s control, the user 
has influence over the following factors, which frequently have a greater 
impact on performance: 

• Level (granularity) of parallelism exploited 

• Frequency of calls to the multitasking library 

• Partition of work and its distribution among processors 

• Programming style in the choice of multitasking mechanisms 

Of all the factors listed, the most important are granularity of task, 
and balanced workload distribution. 


5.5.2 MANUAL PERFORMANCE PREDICTION 

In this section a simple program is analyzed to predict the performance 
if it is converted for multitasking. The original code has the following 
structure : 


PROGRAM MAIN 


DO 100 I = 1, 50 

• 

DO 10 J = 1, 2 
CALL SUB ( J) 
10 CONTINUE 

« 

100 CONTINUE 


STOP 

END 
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Analysis of the program indicates that the 100 loop has dependent 
iterations and cannot be executed in parallel. The 10 loop has 
independent iterations and multitasking can be attempted at this level. 
Execution on one CPU shows that 96 percent of the execution time is spent 
in subroutine SUB with an average time of 0.2 seconds per call. The 
total run time is 20.83 seconds. 

First, the theoretical speedup is computed. This computation assumes no 
multitasking overhead or other performance degradation. 

Execution time(l-CPU) = Time(Seq) + Time(Mt) 

=(4 percent + 96 percent) *20. 83 seconds 
= 100 percent*20.83 seconds 
= 20.83 seconds 

Execution time(2-CPU) = Time(Seq) + (1/2) *Time (Mt) 

=(4 percent + 48 percent) *20.83 seconds 
= 52 percent* 20. 83 seconds 

= 10.83 seconds 

The maximum attainable speedup is calculated as follows: 

Time (1-CPU) 20.83 

Speedup 1.92 

Time (2 -CPU) 10.83 


This is a measure of the parallelism exploited in the program. 

Next, the program is conceptually converted to the following multitasking 
structure (details of scope and storage independence are not considered) : 


PROGRAM MAIN 

COMMON /MT/ I S TART , I DONE , JOB 

CALL TSKSTART ( IDTASK , T) > 

JOB = 1 

DO 100 I = 1, 50 1 

CALL EVPOST ( ISTART) > 

CALL SUB (1) 

CALL EVWAIT ( IDONE) 

CALL EVCLEAR( IDONE) < 

100 CONTINUE 

JOB =2 2 

CALL EVPOST (ISTART) < 

CALL TSKWAIT (IDTASK) 

STOP 

END 


SUBROUTINE T 

COMMON/MT/ISTART , IDONE , JOB 
CALL EVWAIT (ISTART) 

CALL EVCLEAR ( ISTART) 

IF ( JOB.NE.l ) GO TO 2 
CALL SUB (2) 

CALL EVPOST (IDONE) 

GO TO 1 
RETURN 
END 


Now, a realistic 2-CPU execution time can be projected from the intrinsic 
costs of the calls to the multitasking library and estimates of other 
relevant quantities. 
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Execution time(2-CPU) = Time(Seq) + (1/2) *Time (Mt) + Overhead 


where 


Overhead = Time (TSKSTART) + 


Time (TSKWAIT) + 
51*Time (EVPOST) + 
50*Time (EVCLEAR) + 
50*Time (EVWAIT) + 


Workload imbalance delay + 

Memory contention delay 

The overhead is computed in terms of the main task. Overheads in task T 
are either masked by delays in MAIN or cause delays in T that are 
accounted for in the workload imbalance or memory contention delays in 
MAIN. The execution times for the calls to the multitasking library are 
available in section 3. 

Delays caused by workload imbalance and memory contention must be 
approximated. In this example, we assume no workload imbalance and 
estimate the memory contention at 2 percent. This percentage is applied 
to all the time spent in SUB when called from MAIN. This is the time 
when both tasks are accessing Central Memory. With these estimates, the 
overhead can be computed. 

Overhead = 1500000 CP + 

1500 CP + 

51* ( 1500 CP ) + 

50* ( 200 CP ) + 

50* ( 1500 CP ) + 

0 + 

(0.02) *50* ( 0.2 SEC ) 

This calculation gives an overhead of 0.216 seconds, which can be used in 
projecting a realistic 2-CPU execution time. 

Execution time(2-CPU) = Time(Seq) + (1/2) *Time (Mt) + Overhead 

= ( 0.83 + 10.0 + 0.216 ) seconds 
= 11.046 seconds 

This estimate of execution time can be used to compute a realistic 
speedup projection. 

Time (1-CPU) 20.83 

Speedup = = 1.8 8 

Time (2-CPU) 11.046 

The projected speedup can be used to determine whether multitasking is 
worthwhile. The gain can be weighed against the conversion effort 
required. 
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The actual speedup recorded for this example was 1-86 


5. 6 CHOOSING VECTORIZATION OVER MULTITASKING 

Multitasking offers a speedup, Sp < ncpu, over conventional 
processing. It can enhance either scalar or vector performance. Because 
vector processing offers a greater speedup potential over scalar 
processing than multitasking, multitasking should not be employed at the 
expense of vector ization. In the case of a short vector length, scalar 
processing outperforms vector processing. Similarly, in the case of a 
small task size, vector processing (or even scalar processing) may 
outperform multitasking. 

Consider the following simple loop: 

DO 10 I = 1, N 

A ( I) = B ( I) + S*C(I) 

10 CONTINUE 

Depending on the value of N, the greatest performance may come from 
executing this loop in scalar mode, vector mode, or multitasked vector 
mode. If N is very small, scalar processing may be best. When N is 
large, multitasking is appropriate. The speedup depends on the overhead 
of the particular multitasking mechanism used. 
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HIGH LEVEL CONCEPTS AND 
THEIR IMPLEMENTATION 




This section describes high level parallelism concepts and gives 
suggested implementations using the routines of the multitasking library. 


6.1 PARALLELISM 


Multitasking exploits parallelism in programs. Parallelism occurs when 
certain independence and order requirements are satisfied. The degree of 
parallelism is based on the adherence of program constructs to these 
requirements. Three levels of parallelism can be distinguished: 


Level of parallelism 
Full (concurrent) 
Partial (exclusive) 

None (sequential) 


Characteristics 

Complete independence and order independent 

Special dependence relationship and order 
independent 

Dependence and order dependent 


Concurrent parallelism and exclusive parallelism are candidates for 
multitasking. No multitasking speedup is possible without parallelism. 

The initiation and completion of concurrent parallelism is accomplished 
with the TSKSTART and TSKWAIT library routines. The birth and death of a 
task can be used as synchronization points. 

The management of concurrent parallelism uses the EVPOST and EVWAIT 
library routines. These mechanisms provide for the synchronization of 
work between tasks and can be used in communicating data among tasks. 

Exclusive parallelism is managed with the LOCKON and LOCKOFF library 
routines. These routines monitor special program segments that can be 
executed in any order but not simultaneously. Exclusive parallelism is 
only profitably exploited within the environment of concurrent 
parallelism. 

Subsequent sections expand on these ideas. 
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6.2 SYNCHRONIZATION 


Synchronization is the process of bringing two or more tasks to a known 
stage in their execution. The location in each task where this happens 
is called a synchronization point. Synchronization is required to ensure 
that dependencies are satisfied. Frequently, this means guaranteeing 
that variables in one task are computed before they are used in another 
task. 

Synchronization is a cooperative process among tasks. Certain variables, 
such as event variables, must be accessible to all tasks, and each task 
must execute proper, coordinated multitasking calls. 

The initiation and completion of a task are synchronization points, as 
shown in the following example: 


Task 0 


initialize a, b 


SYNCH POINT -> CALL TSKSTART 


1/2 work (a) 


CALL TSKWAIT 

SYNCH POINT -> - < 


Task 1 


1/2 work (b) 


results computed 
in (b) needed here 

Management of concurrent parallelism uses events for synchronization. 
Tasks agree on which events signal the beginning and end of requested 
work, as in the following example: 
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Task 0 


Task 1 


CALL TSKSTART > _ 

I CALL EVWAIT(EVl) 

SYNCH POINT -> CALL EVPOST(EVl) _ 

I CALL EVCLEAR(EVl) 

I I 

1/2 work (a) 1/2 work (b) 

I ! 

CALL EVWAIT (EV2) CALL EVP0ST(EV2) 

SYNCH POINT -> _ < 

CALL EVCLEAR (EV2) CALL EVWAIT (EV1) 

I 

sequential work (c) 

I 

SYNCH POINT -> CALL EVPOST(EVl) -> 

I CALL EVCLEAR (EV1) 

! I 

1/2 work (d) 1/2 work (e) 

I I 

CALL EVWAIT (EV2) CALL EVP0ST(EV2) 

SYNCH POINT -> _ < 

CALL EVCLEAR (EV2) CALL EVWAIT (EV1) 


Task 0 uses event EV1 to synchronize these two tasks by signalling task 1 
that any initialization for work (b) is complete. Task 1 uses event EVl 
to synchronize these two tasks by waiting for it to be posted before 
beginning work (b) . 

In a similar fashion, both tasks use event EV2 to synchronize the 
completion of work (a) and (b) before the start of work (c) . 


6.3 COMMUNICATION 

Occasionally, one task must communicate a variable value to other tasks 
while all are executing. To guarantee that the value is computed in one 
task before it is used in another, the communication must take place at a 
synchronization point. The communicating tasks must agree on the 
location of the shared value. One task computes the value before the 
synchronization point, and the other tasks reference the value only after 
the synchronization point. 
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In the following program, the main task uses the shared variable JOB to 
indicate the computations to be executed by the subordinate task, T. 
Task T stops when JOB equals 3. 


PROGRAM MAIN 

COMMON/MT/ ISTART , IDONE , JOB , A ( 100 0 ) ,3(1000) ,C(1000) 
... call EVASGN, etc. 

CALL TSKSTART ( IDTASK , T) 

JOB = 1 

CALL EVPOST( ISTART) 

DO 10 I = 1, 500 
B (I) = A ( I) + 1.0 
10 CONTINUE 

CALL EVWAIT (IDONE) 

CALL EVCLEAR( IDONE) 

JOB = 2 

CALL EVPOST( ISTART) 

DO 20 I = 501, 1000 
C(I) = B (I) + 2.0 
20 CONTINUE 

CALL EVWAIT (IDONE) 

CALL EVCLEAR( IDONE) 

JOB = 3 

CALL EVPOST (ISTART) 

CALL TSKWAIT { IDTASK ) 

STOP 

END 

SUBROUTINE T 

COMMON/MT/ISTART, IDONE, JOB, A(1000) ,B(1000) ,C(1000) 
1 CALL EVWAIT ( ISTART) 

CALL EVCLEAR( ISTART) 

IF ( JOB.EQ.2 ) GO TO 19 
IF ( JOB.GT.2 ) GO TO 99 
DO 10 I = 501, 1000 
B ( I) = A(I) + 1.0 
10 CONTINUE 

CALL EVPOST (IDONE) 

GO TO 1 

19 CONTINUE 

DO 20 I = 1, 500 
C(I) = B (I) + 2.0 

20 CONTINUE 

CALL EVPOST (IDONE) 

GO TO 1 
99 CONTINUE 
RETURN 
END 
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The integrity of the variable JOB is guaranteed because the programmer 
has defined and followed a rule by which the main task only references 
JOB after I DONE is posted and before ISTART is posted, and task T only 
references JOB after ISTART is posted and before I DONE is posted. 


6.4 MONITOR 


Certain program constructs have data and storage dependencies that at 
first appear to prevent parallel processing. These constructs involve 
the updating of a variable using an operation that is both commutative 
and associative, for example addition or multiplication. This is the 
case of exclusive parallelism, and it can be executed in parallel if an 
update of a variable can be guaranteed not to interfere with any other 
update of the same variable. This can be guaranteed if the updates are 
monitored to ensure that only one update is ever executing at a given 
time. 


DO 20 I = 1, N 
DO 10 J = 1, N 

A (I , J) = B(I)*C(J) 

10 CONTINUE 

S = S + SIN (A ( 1 , 1) ) 

20 CONTINUE 

In the above example, the iterations of the 20 loop are data and storage 
dependent because of the variable S. The dependence causes a problem if 
the updates to S are attempted in different tasks. Within a given task, 
fetching S may not obtain the correct value if the other task is in the 
update process. Simultaneous updates may overwrite and lose a needed 
value. 

This problem can be circumvented and the iterations of the 20 loop 
executed in parallel if the updates are never simultaneous. Two 
solutions are possibles 

• Use the LOCK facilities to form a critical region around the 
problem code. This alternative is described in detail later in 
this section. 

• Recognize that only the last value, rather than intermediate 
values, of S is required. The multitasking of reduction 
constructs, such as summation, is also described later in this 
section. 


SN-0222 


6-5 



6.5 PRIVATE, SHARED, AND COMMON VARIABLES 


To guarantee independence, especially storage independence, the use of 
the variables in the code to be multitasked must be analyzed. The 
results of the analysis help make a deliberate allocation of variables 
according to their use. 

The allocation of variables in the original program may conflict with the 
multitasking use. Modifications made for multitasking may affect 
portions of the program not being multitasked. The allocation of 
variables for use in multitasking is one of the most important steps in 
conversion and is easily overlooked. Failure to address this aspect of 
multitasking causes subtle bugs that can be difficult to identify. 

Variables used in a multitasked segment of code can be categorized as 
follows according to the way in which they are used by the tasks that 
have access to them: 


Common Accessible to all tasks, with assignment references only to 
independent storage locations 

Shared Accessible to all tasks, fetch references only, or 
monitored assignment 

Private Accessible to one task, with no restrictions on use but 
always defined before used 

Occasionally, we use the term global to refer to common or shared and 
the term local to refer to private. These terms relate to the 
accessibility of variables by tasks. 

To categorize the variables in a program and identify the allocation 
required for multitasking, the ramifications of introducing a new scope 
boundary in the original code must be analyzed. 

Consider the following program segment: 


Original code 
INPUTS 


| WORK A | > 


OUTPUTS 


Multitasked code 


INPUTS 


I WORK B | 


OUTPUTS 


INPUTS 


| WORK C | 


OUTPUTS 
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In this example, work A is considered for multitasking by division into 
parts B and C. The variables in work A have relationships (scope and 
dependencies) with the inputs and outputs to work A. These relationships 
must be accommodated in the multitasking code. For example, the inputs 
to A may by required by the task executing work C, and the outputs of 
work C may be required by the code following work A. In addition, the 
tasks executing work B and C may need individual copies of single 
variables appearing in work A. 

Consider the following example: 

Original code 

DO 10 I = 1, 1000 
A = B + FLOAT (I) 

C = C + 1. 

D ( I) = A* 2. 

10 CONTINUE 

The analysis showing that the iterations of the 10 loop can be executed 
in parallel relies on the analysis of the multitasking use of the 
variables referenced in the loop. An attempt to multitask this loop 
introduces a new scope boundary at DO 10 and CONTINUE. The minimum block 
scope of the variables referenced in the loop is identified and compared 
with this new scope region. If the minimum block scope of a variable is 
contained within the introduced scope boundaries, then that variable is 
local (private) to each task. Private variables are always defined 
(assigned) before they are used within the new scope boundaries. The 
variable A in the example is a private variable. 

If the minimum block scope of a variable extends outside of the new scope 
boundaries, then the variable is global (shared or common) to all tasks. 
The program must be modified to maintain the variable's scope over the 
new multitasked code. Usually, this involves putting the variable into 
COMMON statements or into the argument list of the TSKSTART statement. 

The variables B, C, and D in the previous example are global variables. 
The variable B is a shared variable that has only fetch references. 
Variable C is a shared variable that must be monitored to avoid 
simultaneous updates. Variable D is a common variable whose elements are 
independently assigned. 

The original code can be converted for multitasking as follows: 
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Multitasked code 


COMMON / MT / LOCKC, B, C, D(1000) 

• • • 

CALL TSKSTART ( IDTASK, TASK ) 

DO 10 I = 1, 500 
A = B+FLOAT (I) 

CALL LOCKON ( LOCKC ) 

C = C + 1. 

CALL LOCKOFF ( LOCKC ) 

D ( I) = A*2. 

10 CONTINUE 

CALL TSKWAIT ( IDTASK ) 


SUBROUTINE TASK 

COMMON / MT / LOCKC , B, C , D(1000) 

DO 10 I = 501, 1000 
A = B+FLOAT (I) 

CALL LOCKON ( LOCKC ) 

C = C + 1. 

CALL LOCKOFF ( LOCKC ) 

D (I) = A*2 . 

10 CONTINUE 
RETURN 
END 

Two variables named A now exist in the above example, one in MAIN and one 
in TASK. The storage location B is only fetched and need only be made 
accessible to all tasks. The storage location C is both fetched and 
assigned; it must be monitored as well as made accessible to all tasks. 
Different storage locations of array D are assigned by each task, but the 
whole array is accessible to all tasks. 

Modifications for multitasking (for example, putting variables into 
common blocks) may interfere with the storage assignment of variables in 
the original program. In the above example, variable A may have been 
originally contained in a common block. Its uniprocessing use 
accommodated its accessibility (or reusability) by several program 
units. Its multitasking use requires that its use be private. 

Likewise, variables B, C, and D may have been contained in common blocks 
along with other variables not involved with multitasking. Also, the 
placement of these variables into common blocks may interfere with the 
use of these variables (or other variables with the same name) in other 
parts of the program. The use of all variables involved in multitasking 
(and other variables with the same name) must be understood completely 
over the whole program. Special attention must be paid to variables that 
are equivalenced to other variables. 
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The following rules aid in determining the categories of variables 
appearing within a loop considered for multitasking. The loop control 


variable is assumed to be I. 

Type of variable 

Variable subscripted by I 

Variable not subscripted by I 
and appears only on the left 
side of an assignment 

Variable not subscripted by I 
and appears only on the right 
side of an assignment 

Variable not subscripted by I, 
appears on both left and right 
hand side, and is always 
defined before use 

Variable not subscripted by I, 
appears on both left and right 
hand side, and is not defined 
before use 


Category 
GLOBAL (common ) 

LOCAL (private) . Watch out for 
live variable after loop. 

GLOBAL (shared) 

LOCAL (private) 

ERROR or GLOBAL (shared and 
monitored) 


The following characteristics of local and global variables should be 
considered when identifying the multitasking use of variables. 

Characteristics of local variables in relation to tasks: 

• Multiple copies (one per task) 

• Temporary existence (dies when task dies) 

9 Cannot be referenced by other tasks (private) 

• Always defined before used within task 

• Usually scalars or small workspace arrays 

Characteristics of global variables in relation to tasks: 


• One copy (independent of number of tasks) 

• Permanent existence (die when job dies) 

• Can be referenced by all tasks (shared or common) 

• Fetch only, independently used, or monitored 

• Usually larger arrays, lock or event variables 
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6 . 6 TASK COMMON 


Conventional FORTRAN provides COMMON blocks for sharing variables among 
program units. COI-JMON blocks are also used to minimize the length of 
argument lists in CALL statements. 

Variables appearing in COMMON blocks occupy static storage locations for 
the life of the program. The global sharing characteristic of COMMON may 
conflict with the necessity for tasks to have private variables. Tasks 
may need to share certain variables among the program units that comprise 
the task but require these variables to be inaccessible from any other 
task. The fact that a particular program unit may be executed by either 
task compounds the problem. 


NOTE 

A CFT language extension is planned for the COS/CFT 
1.14 release that explicitly provides a task common 
capability. The examples in this section present 
techniques that can be used to effect a similar 
capability with standard FORTRAN. 


The following program illustrates the problem. 

COMMON / ARGS / A (100) , B(100) 
COMMON / RESULT / C(100) 

DO 20 I = 1, 100 
DO 10 J = 1, 100 
A ( J) = I+J 
B ( J) = I*J 
10 CONTINUE 

CALL SUB (I) 

20 CONTINUE 
STOP 
END 

SUBROUTINE SUB (I) 

COMMON / ARGS / A (100) , B(100) 
COMMON / RESULT / C(100) 

C(I) = SDOT (100,A,1,B,1) 

RETURN 

END 


This program may be incorrectly converted for multitasking as follows: 
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COMMON / ARGS / A(100) , B(IOO) 
COMMON / RESULT / C(100) 

CALL TSKSTART ( IDTASK , TASK ) 
DO 20 I = 1, 50 
DO 10 J = 1, 100 
A ( J) = I+J 
B ( J) = I*J 
10 CONTINUE 
CALL SUB (I) 

20 CONTINUE 

CALL TSKWAIT ( IDTASK ) 

STOP 

END 


SUBROUTINE TASK 

COMMON / ARGS / A(100) , B(100) 

COMMON / RESULT / C(100) 

DO 20 I = 51, 100 
DO 10 J = 1, 100 
A ( J) = I+J 
B ( J) = I*J 
10 CONTINUE 
CALL SUB (I) 

20 CONTINUE 
RETURN 
END 


SUBROUTINE SUB (I) 

COMMON / ARGS / A (100) , B(100) 
COMMON / RESULT / C(100) 

C(I) = SDOT (100,A,1,B,1) 

RETURN 

END 


The program is incorrect because the storage independence of A and B 
required for multitasking is not provided. The main program and the 
generated task each want to share their own private data values of A and 
B with the subroutine SUB. Two copies of A and B are needed. 

Several multitasking programming techniques provide for task-private 
sharing of variables. These techniques do one of the following: 

• Statically allocate copies of the original storage and employ 
various methods to communicate which portion belongs to each task. 

• Dynamically allocate local variables upon subprogram entry to 
cause duplicate copies to be created. 

To correct the program in the previous example, we can increase the 
extents of A and B and use an offset variable to cause each task to 
access distinct memory locations. 
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COMMON / ARGS / A(200), B(200) 
COMMON / RESULT / C(100) 

CALL TSKSTART ( IDT ASK , TASK ) 
IOFFSET = 0 
DO 20 I = 1, 50 



DO 10 J = 1 , 10 0 
A (J+IOFFSET) = 

I+J 


B ( J+IOFFSET) = 

I*J 

10 

CONTINUE 



CALL SUB (I, IOFFSET) 

20 

CONTINUE 



CALL TSKWAIT ( IDTASK ) 

STOP 

END 

SUBROUTINE TASK 

COMMON / ARGS / A(20O), B(200) 

COMMON / RESULT / C(100) 

IOFFSET = 100 

DO 20 I = 51, 100 



DO 10 J = 1 , 100 



A (J+IOFFSET) = 

i+j 


B (J+IOFFSET) = 

i*j 

10 

CONTINUE 



CALL SUB ( I , IOFFSET) 

20 

CONTINUE 



RETURN 

END 

SUBROUTINE SUB ( I , IOFFSET) 

COMMON / ARGS / A(200) , B(200) 

COMMON / RESULT / C(100) 

C(I) = SDOT (100,A(1+IOFFSET) ,1,B (1+IOFFSET) ,1) 

RETURN 

END 

Alternatively, A and B can be made local variables to MAIN and TASK. 

DIMENSION A(100) , B(100) 

COMMON / RESULT / C(100) 

CALL TSKSTART ( IDTASK, TASK ) 

DO 20 I = 1, 50 
DO 10 J = 1, 100 
A ( J) = I+J 
B ( J) = I*J 
10 CONTINUE 

CALL SUB (I, A, B) 

20 CONTINUE 

CALL TSKWAIT ( IDTASK ) 

STOP 

END 
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SUBROUTINE TASK 
DIMENSION A (100 ) , B (100) 

COMMON / RESULT / C(100) 

DO 20 I = 51, 100 
DO 10 J = 1, 100 
A ( J) = I+J 
B(J) = I*J 
10 CONTINUE 

CALL SUB (I, A, B) 

20 CONTINUE 
RETURN 
END 

' SUBROUTINE SUB ( I , A,B) 

DIMENSION A (100) , B(100) 

COMMON / RESULT / C(100) 

C(I) = SDOT(100,A,1,B,1) 

RETURN 

END 

The method employed should give attention to the effect that a 
reorganization of data structures and argument lists may have on other 
parts of the program. 


6 . 7 DOALL 


A DOALL is a loop with independent iterations. A partition of the 
iterations of a DO-loop divides the iterations into groups. Each group 
can be executed by a task executing in parallel. 

The partition of iterations is called static if the iterations belonging 
to each group are known before execution time. The partition of 
iterations is called dynamic if either the number of iterations belonging 
to a group or the assignment of iterations to groups is not known until 
execution time. 

For vector ization, most of the work is found in loops. This is where 
parallelism is exploited. This is also true of multitasking, but on a 
larger scale. The most frequent application of multitasking is the 
simultaneous execution of independent iterations of loops. The 
techniques for multitasking loops are of fundamental importance. 
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The first case considered is a loop where each iteration has equal 
computational requirements. A choice exists as to how these independent 
iterations should be grouped for execution as distinct tasks. At one 
extreme, each iteration can be computed by a separate task. At the other 
extreme, the iterations can be grouped into a number of tasks equal to 
the number of processors. Having fewer groups (tasks) than processors 
prevents full utilization of the processor resources. In the middle, a 
balance of the number of groups and the number of iterations per group 
can enhance both vector izat ion and multitasking speedups in some cases. 

For equal iteration workloads and a given number of processors (for 
example, two) , statically dividing the iterations into two groups is 
natural. Each group may comprise the even or odd numbered iterations or 
the first and second half. The even-odd partition may increase bank 
conflicts. Consider the following matrix addition example: 

DO 20 J = 1, N 
DO 10 1=1, N 

A ( I , J) = B ( I , J) + C ( I , J) 

10 CONTINUE . 

20 CONTINUE 

The 20 loop is multitasked by partitioning the range of J into first half 
and second half. This corresponds to partitioning the matrix addition 
problem into two parts: the left half of A and the right half of A. For 
a sufficiently large N, the right half computation may be formed into a 
separate task. 

Task 0 Task 1 

COMMON / MT / A,B ,C,N 

L = N/2 

LP1 = L + 1 

CALL TSKSTxART ( IDT , T , LP1 , N) SUBROUTINE T(IS,IE) 

COMMON / MT / A , B , C , N 
DO 20 J = IS, IE 

CALL T (1,L) DO 10 I = 1, N 

A ( I , J) = B(I,J)+C(I,J) 

10 CONTINUE 

CALL TSKWAIT (IDT) 20 CONTINUE 

RETURN 

END 

The same subroutine T is part of both tasks in this example. Its 
arguments determine which half of the computation is to be performed. 

The overhead of task generation (for example, stack allocation of local 
variables) usually makes EVENTS a better mechanism for managing loop 
parallelism. This is especially true when several loops can be 
multitasked, as following example illustrates: 
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PROGRAM MAIN 

COMMON/MT/ISTART , IDONE , JOB , A ( 1 0 0 0 ) ,B(10Q0) ,C(1000) 

CALL TSKSTART ( IDTASK , T) 

JOB = 1 

CALL EVPOST ( ISTART) 

DO 10 I = 1, 500 
B (I) = A (I) + 1.0 
10 CONTINUE 

CALL EVWAIT( IDONE) 

CALL EVCLEAR( IDONE) 

JOB = 2 

CALL EVPOST (ISTART) 

DO 20 I = 501, 1000' 

C(I) = B (I) + 2.0 
20 CONTINUE 

CALL EVWAIT( IDONE) 

CALL EVCLEAR( IDONE) 

JOB = 3 

CALL EVPOST (ISTART) 

CALL TSKWAIT ( IDTASK ) 

STOP 

END 

SUBROUTINE T 

COMMON/MT/ISTART, IDONE, JOB, A (1000) ,B(1000) ,C(1000) 

1 CALL EVWAIT( ISTART) 

CALL EVCLEAR( ISTART) 

IF ( JOB.EQ.2 ) GO TO 19 
IF ( JOB.GT.2 ) GO TO 99 
DO 10 I = 501, 1000 
B (I) = A(I) + 1.0 
10 CONTINUE 

CALL EVPOST ( IDONE) 

GO TO 1 

19 CONTINUE 

DO 20 I = 1, 500 
C(I) = B ( I) + 2.0 

20 CONTINUE 

CALL EVPOST (IDONE) 

GO TO 1 
99 CONTINUE 
RETURN 
END 

In the above example, the event ISTART signals the waiting task T to 
begin work specified by the flag JOB. The event IDONE signals the main 
task that work has been completed. Task T is programmed as an infinite 
loop always going back to statement 1 to look for more work to do. A 
flag value of 3 terminates task T. 
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When the workload of each iteration is variable, a different technique may 
be appropriate. A static partition of the iterations may result in 
unequal execution times for each task, causing some tasks to wait 
unnecessarily for other tasks to complete. One solution is to have the 
iterations schedule themselves as shown below: 

Task 0 Task 1 

COMMON/ MT /A (10 0,100) ,J,JLOCK,N 
J = 0 

CALL TSKSTART ( IDT , T) 

CALL T 1 

CALL TSKWAIT ( IDT) 


10 

20 

99 


Subroutine T is a part of both tasks in this example. Each task accesses 
and updates a shared variable J, which is the outer loop variable. This 
access is under the protection of a lock to guarantee exclusive updates. 
Each task copies the next value of J to a private location, JLOCAL, and 
commits itself to that iteration. When an iteration completes, each task 
goes back to look for unprocessed iterations until all are performed. 

Which task will compute a given iteration is not known at the start. 

Using this technique, the workload tends to be balanced among the tasks. 
The task that commits to shorter iterations will do more of them. The 
difference in completion time between the two tasks is at most one 
iteration time. 

The dynamic partitioning technique incurs an overhead for each iteration, 
raising the question of whether the overhead compensates the workload 
imbalance. An alternative is to dynamically schedule fixed size groups of 
iterations. A task then commits to a range of the values of J. This 
reduces the overhead but lessens the load balancing capabilities. 


SUBROUTINE T 

COMMON/ MT /A (10 0,100) ,J,JLOCK,N 
CALL LOCKON ( JLOCK ) 

JLOCAL = J + 1 
J = JLOCAL 

CALL LOCKOFF ( JLOCK ) 

IF ( JLOCAL. GT.N ) GO TO 99 
IF ( A ( 1 , JLOCAL) • EQ .0.0 ) GO TO 20 
DO 10 I = 1, N 
A (I, JLOCAL) = 

A ( I , JLOCAL) /A ( 1 , JLOCAL ) 
CONTINUE 
GO TO 1 
RETURN 
END 
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6.8 COBEGIN 


A COBEGIN is a sequence of independent program segments. These segments 
may be loops or CALL statements. Because the segments are independent, 
profitable multitasking is possible if the segments are of similar size. 


Example : 

CALL FFT ( INI, 0UT1, TEMPI, N ) 

CALL FFT ( IN2, 0UT2, TEMP 2 , N ) 

The independence of input, output, and work space allows these FFTs to be 
done in separate tasks. Multitasking speeds up the computation in this 
case, while vector ization across the FFTs does not help. 

COBEGIN can be viewed as a generalization of DOALL with independent 
segments instead of independent iterations. The relationship is made 
clearer by transforming the segments into a loop. 

DO 10 I = 1, 2 

IF(I.EQ.l) CALL FFT ( INI, OUT1, TEMPI, N ) 

IF (I.EQ. 2) CALL FFT ( IN2, OUT2, TEMP 2 , N ) 

10 CONTINUE 

This example is easily converted for multitasking. 

CALL TSKSTART ( IDFFT, FFT, INI, OUT1, TEMPI, N ) 

CALL FFT ( IN2, OUT2, TEMP 2, N ) 

CALL TSKWAIT ( IDFFT ) 


6. 9 POP I PE 

A DOPIPE is a software pipeline of program segments within a loop. 
Dependencies among the segments prevent the loop from having independent 
iterations. Nevertheless, the iterations of the loop can be executed in 
parallel if the dependencies are satisfied. 


Example : 

DO 10 I = 2, N 

A (I) = A(I-l) + B (I) 
D (I) = A ( I) + C (I) 

10 CONTINUE 


< Statement 1 

< Statement 2 
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In this example, dependencies involving the variable A prohibit the 
iterations of the loop from being performed independently. However, once 
the value is assigned to A in one iteration, the next iteration can 
begin. Figure 6-1 illustrates how the iterations are pipelined to 
satisfy the dependence requirement and to allow simultaneous execution of 
the iterations. 


Task 0 


Task 1 


Time 


t 


STMT 

1 

(1=2) 

WAIT 



STMT 

2 

(1=2) 

STMT 

1 

(1=3) 

STMT 

1 

(1=4) 

STMT 

2 

(1=3) 

STMT 

2 

(1=4) 

STMT 

1 

(1=5) 

• 



STMT 

2 

(1=5) 


Figure 6-1. Pipelining 


Task 0 


Task 1 


COMMON/MT/A , B , C , D , N , EV1 , EV2 
CALL TSKSTART (IDT,T) 

DO 10 I = 2, N, 2 
A(I)=A(I-1)+B(I) 

CALL EVWAIT (EV2) 

CALL EVCLEAR (EV2) 

CALL EVPOST (EV1) 

D ( I ) = A ( I ) +C ( I ) 

CALL EVWAIT (EV2) 

CALL EVCLEAR (EV2) 

CALL EVPOST (EV1) 

10 CONTINUE 

CALL TSKWAIT ( IDT) 


SUBROUTINE T 

COMMON/MT/A , B , C , D , N , EV1 , EV 2 
DO 10 I = 3, N, 2 
CALL EVPOST (EV 2) 

CALL EVWAIT (EV1) 

CALL EVCLEAR (EV1) 

A (I) = A ( 1-1) +B ( I) 

CALL EVPOST (EV2) 

CALL EVWAIT (EV1) 

CALL EVCLEAR (EV1) 

D(IL) = A(IL)+C(IL) 

10 CONTINUE 
RETURN 
END 


A dynamic pipeline can also be used to solve this problem. Locks are 
used to monitor the commitment of tasks to iterations. 

The following example gives an alternate, dynamic implementation: 

Original code: 

DO 10 1=1, N 

A (I) = A(I-l) + B (I) 

10 D ( I) = A (I) + C(I) 
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Task 0 


Task 1 


COMMON/ MT /A,B,C,D,N,I,L1 
1 = 1 

CALL TSKSTART(IDT,T) SUBROUTINE T 

COMMON/ MT /A,B,C,D,N,I,L1 

CALL T 10 CALL LOCKON(Ll) 

IL = 1+1 

CALL TSKWAIT ( IDT) I = IL 

IF ( IL.GT.N ) GO TO 20 
A (IL) = A(IL-l) + B ( IL) 
CALL LOCKOFF (LI) 

D ( IL) = A (IL) + C ( IL) 

GO TO 10 

20 CALL LOCKOFF (LI) 

RETURN 

END 


The following example contains two (or more) piping segments. This 
implementation approach is independent of the number of segments or the 
number of processors. 

Original code: 

DO 10 1=1, N 

A (I) = A(I-l) + B (I) 

10 D (I) = D ( I — 1 ) + A (I) 


Task 0 Task 1 

COMMON/ MT /A,B,C,D,N,I,L1,L2 

1 = 1 

CALL TSKSTART ( IDT , T) SUBROUTINE T 

COMMON/ MT /A,B,C,D,N,I,L1,L2 

CALL T 10 CALL LOCKON(Ll) 

IL = 1+1 

CALL TSKWAIT (IDT) I = IL 

IF ( IL.GT.N ) GO TO 20 
A ( IL) = A(IL-l) + B ( IL) 

CALL LOCKON (L2) 

CALL LOCKOFF (LI) 

D (I) = D(I-l) + A ( I ) 

CALL LOCKOFF (L2) 

GO TO 10 

20 CALL LOCKOFF (LI) 

RETURN 

END 
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For this technique to be worth the price of the overhead, the program 
segments in the pipeline must be of sufficient size. Additionally, each 
segment must be of similar size so that the workload for each task is 
balanced at every stage. 


6.10 CRITICAL REGION 

A critical region is a program segment that can be executed by only one 
task at a given time. The lock facilities guarantee that this happens. 
Locks are not associated directly with any computational variable or its 
storage location; they are rather associated with code that references 
the variable. 

A critical region is formed by turning the lock on before entering the 
program segment and off after leaving. A critical region may be one 
physical program segment or corresponding program segments in different 
tasks. A task attempting to enter a critical region that is occupied by 
another task must wait until the other task exits the region, as shown in 
the following example: 

Task 0 Task 1 


CALL LOCKON (LOCK1) 

I 

critical reaion 1 

I 

CALL LOCKOFF (LOCK1) 
I 
I 
I 
I 
I 
I 

CALL LOCKON (LOCK2) 
wait 

I 

critical region 2 

I 

CALL LOCKOFF (LOCK2) 
I 
I 


CALL LOCKON (LOCK1) 
wait 

T 

critical region 1 

I 

CALL LOCKOFF (LOCK1) 

I 

I 

CALL LOCKON (LOCK 2) 
i 

critical region 2 

I 

CALL LOCKOFF (LOCK 2) 
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The following example illustrates the use of locks to manage exclusive 
parallelism in the environment of concurrent parallelism. Consider the 
original code segment: 

S = 0.0 

DO 20 I = 1, N 
DO 10 J = 1, N 

A ( I , J) = B (I) *C ( J) 

10 CONTINUE 

S = S + SIN (A (I , 1) ) 

20 CONTINUE 


The iterations of the 20 loop can be executed in parallel if the updates 
to S are never performed simultaneously. One lock protects the access of 
all tasks to this update statement. The multitasked code is as follows: 


Task 0 

COMMON/MT/A , B , C , N , S , LOCKS 
CALL LOCKASGN ( LOCKS ) 

S = 0.0 

CALL TSKSTART ( IDT, T ) 

DO 20 I = 1, N/2 
DO 10 J = 1, N 

A (I , J) = B(I)*C(J) 

10 CONTINUE 

CALL LOCKON ( LOCKS ) 

S = S + SIN (A(I,1) ) 

CALL LOCKOFF ( LOCKS ) 

20 CONTINUE 

CALL TSKWAIT ( IDT ) 

The following subsection offers an< 
summation. 


Task 1 


SUBROUTINE T 

COMMON/MT/A , B , C , N , S , LOCKS 
DO 20 I = N/2+1, N 
DO 10 J = 1, N 

A(I,J) = B ( I) *C ( J) 

10 CONTINUE 

CALL LOCKON ( LOCKS ) 

S = S + SIN (A ( I , 1) ) 

CALL LOCKOFF ( LOCKS ) 

20 CONTINUE 
RETURN 
END 

approach to the multitasking of 


6.11 SUMMATION AND OTHER REDUCTION CONSTRUCTS 

Occasionally, exclusive parallelism constructs are found within the 
context of otherwise completely independent code. Such constructs 
include summation, product, minimum, maximum, and search. One 
alternative for executing these constructs in parallel along with the 
rest of the code is presented in the previous section. That technique 
uses locks to maintain the exclusive independence of the construct while 
exploiting a higher level of parallelism. 
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The technique presented here exploits parallelism in the constructs 
themselves. This technique is useful when parallelism is not present at 
a higher level or when these constructs form a significant portion of the 
work to be done. 

Each task to forms a partial result (partial sum, local maximum, etc.), 
and the partial results are collected to form a total result by the main 
task after all tasks have synchronized. This approach does not require 
locks, since each task has independent storage locations in which to 
compute its partial result. The main task, however, needs access to all 
partial results. The following code shows the example of the previous 
section using this technique. 

Task 0 Task 1 

COMMON /MT/ A , B , C , N , S 1 

50 = 0. 

51 = 0. 

CALL TSKSTART ( IDT, T ) 

DO 20 I = 1, N/2 
DO 10 J = 1, N 

A (I , J) = B(I)*C(J) 

10 CONTINUE 

SO = SO + SIN (A (1,1) ) 

20 CONTINUE 

CALL TSKWAIT ( IDT ) 

S = SO + SI 


SUBROUTINE T 
COMMON/MT/A , B , C , N , S 1 
DO 20 I = N/2+1 , N ■ 

DO 10 J = 1, N 

A ( I , J) = B(I)*C(J) 
10 CONTINUE 

SI = SI + SIN (A ( I , 1) ) 
20 CONTINUE 
RETURN 
END 


Here Task 0 computes its partial sum in SO, while Task 1 computes its 
partial sum in SI. After Task 1 completes. Task 0 computes the total 
sum, S, from SO and SI. 


/ 
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PROGRAM ANALYSIS AND 7 

CONVERSION 


Analyzing and converting programs to use multitasking requires an 
understanding of the parallelism concepts presented in previous sections 
and a knowledge of the function and correct use of the multitasking 
library utilities. This section describes a procedure for finding 
parallelism in programs , analyzing the independence requirements, and 
finally writing the multitasked code. 

The original program is assumed to be debugged and working correctly. 


7.1 CONDITIONAL MULTITASKING 

A powerful debugging technique that should be considered when analyzing a 
program is simply to set up the modified program so that the multitasking 
can easily be turned on or off. This technique is mentioned here because 
it is best implemented from the start rather than after errors have been 
found. If unexpected results are produced, rerunning the same code 
without multitasking helps identify whether or not the problem is related 
to multitasking. The alternative is to try to maintain the 
nonmultitasked (original) program, but this is often difficult. 

The following must be considered when implementing conditional 
multitasking : 

• A common logical variable must be defined that is set with input 
or with a data statement. This variable controls multitasking, 
and all calls to the multitasking library subroutines should be 
made only if this variable has a value of TRUE. 

• A mechanism must be set up to base the partitioning of the data on 
the use of multitasking. If multitasking is turned off, for 
example, a task must process all the data, not just a fraction. 

The design of the program will dictate how this is best done. 
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7.2 LOCATING POTENTIAL PARALLELISM 

The goal of the first analysis step is to understand the program flow. 

The purpose is to locate portions of the program that have the potential 

for multitasking. The potential exists if loops (potential DOALLs) are 

identified, and if the granularity of work is sufficient to consider 

multitasking. 

The following stages comprise this step: 

Stage 1: Identify the time consuming routines of the program. The 

greatest program performance improvement can be made only if 
most of the work is multitasked. The CFT FLOWTRACE (ON=F) 
option should be used and the resulting program executed on one 
CPU to obtain this information. 

Stage 2: Form a static calling tree to better understand the call 

relationships among the subroutines. Use the option TREE=FULL 
with the CBXRF utility (see section 8) to obtain a static 
calling tree listing. 

Stage 3: Form a dynamic calling tree (using the output of FLOWTRACE) for 
those parts of the static calling tree containing the time 
consuming subroutines. A dynamic calling tree shows the 
looping structures contained in and containing the time 
consuming subroutines. The statement label use table produced 
by CFT can aid in the location of DO and IF loops. 

Stage 4: Identify loops with sufficient work granularity to be 
considered for multitasking. Use one millisecond as a 
guideline. The FLOWTRACE output is an aid in the estimation of 
work size for loops containing calls to subroutines, but it may 
also help in estimating work granularity of loops within 
subroutines if loop bounds are known. 

Stage 5: Eliminate loops with obvious data or control dependencies that 
prohibit them from being multitasked. The outer time step loop 
is an example in which results computed on one iteration are 
inputs for the next iteration. 

Stage 6: Upon reaching this stage, a nested set of loops (possibly 

containing calls to subroutines containing lower level loop 
nests) or a collection of nested sets have been identified for 
further analysis. For each nested set of loops, choose the 
outermost loop for multitasking consideration and proceed to 
step 2 in the next section. 
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7.3 VERIFYING AND CREATING INDEPENDENCE 

In the second step, the goal is to understand the use of variables 
referenced in each of the nested sets of loops identified in step 1. The 
purpose is to verify that computational, storage, and temporal 
independence are present or that the program can be modified to create 
the independence required for multitasking. 

The stages in understanding the use of variables are as follows: 

Stage 1: With the multitasking model in mind (DOALL) , identify the new 
scope boundary with the loop being considered for multitasking. 

Stage 2: Record all variables referenced within the new scope boundary. 

Use the CB=FULL option with the CBXRF utility to analyze the 
use of COMMON block variables. 

Stage 3: Determine the computational independence of each variable by 
assessing its data independence for all control paths through 
the iterations. Ask the questions, "Is the variable produced 
by another iteration?" and, "Is the variable used by another 
iteration?" The answers must be "no", either now or after 
stage 5. 

Stage 4: Evaluate each variable for storage independence according to 
its multitasking use. Will the variable be common or shared 
(one copy), or will it be private (one copy for each task)? 
Compare the minimum scope of each variable with the new scope 
boundary introduced by multitasking. Take care to follow the 
spread of scope through COMMON and argument lists. Watch out 
for EQUIVALENCES. 

Stage 5: Modify the old code with the new scope required for 

multitasking. Place common or shared variables in COMMON 
blocks. Remove private variables from COMMON blocks. Pay 
attention to the ramifications of these changes on other parts 
of the code. Do the changes made for storage independence now 
create the computational independence required in stage 3? 

Stage 6: Maintain determinism and temporal independence, recognizing the 
possible need to synchronize the tasks at the start and end of 
inner loops. Also note which variables require monitoring. 

Stage 7: The above stages are performed for each set of nested loops. 

If a dependence that cannot be removed is found, then go back 
to step 1 in the previous subsection, choose the next lower 
level DO-loop, and redo step 2 here. If independence is 
guaranteed, then proceed to step 3 (writing multitasked code) . 
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7.4 WRITING MULTITASKING CODE 


Most of the decisions concerning the multitasking code to write have been 
made during the previous analysis steps. The DOALL model can be used for 
each set of nested loops to form a task, or several sets can be combined 
into one larger task. Static or dynamic partitioning of iterations can 
be chosen. Mechanisms for synchronization are inserted where necessary. 
Locks are chosen for each critical region of code to be protected. The 
storage of variables has been reorganized according to their multitasking 
use. 

A final inspection of the code should check that the multitasking 
mechanisms employed have been properly initialized, that event and lock 
names are shared among the tasks that use them, and that all tasks will 
have completed at job end. The possibility that fewer processors may be 
available than were expected in the original multitasking program design 
should be considered. 

In the examples presented and in the procedure previously outlined, a 
particular programming style was employed. For the conversion of 
programs multitasking on p processors, work was partitioned into p 
components. Each component was intended to be executed simultaneously by 
a separate processor. Synchronization followed each partitioned segment 
of work. 

This style was employed because it is easier to conceptually manage a 
rigid structure of parallelism. Additionally, it was recognized that the 
programmer is in a better position than the library scheduler to control 
the partitioning and balancing of work among the processors, because only 
the programmer knows the execution time of the tasks started. 

The user is by no means limited to this style. The facilities provided 
enable the generation of many more tasks than processors, and the library 
scheduler allocates the work to be done among the resources available. 


7.5 I/O 

The major concern throughout this publication has been the multitasking 
of computations. For some applications, attention to I/O is at least as 
important as the speedup of computation. 

Multitasking I/O is possible on a limited basis. Different tasks can 
perform I/O on different files. I/O on the same file by different tasks 
is limited by the nondeterministic nature of task execution. Parts of 
the I/O support library are critical regions that are protected from 
simultaneous access and thus limit the parallelism that can be exploited. 
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DEBUGGING 


8 


Debugging multitasked programs is difficult, A multitasked program could 
fail for any number of reasons, including the following: 

• Errors in arithmetic, coding, or algorithm unrelated to 
multitasking 

• Deadlock between tasks 

• Lack of synchronization between tasks 

• Failure to protect critical regions 

© Violations of data dependence relationships 

• Failure to provide for local variables 

3.1 FREQUENT ERRORS 


The following list is intended as a checklist for avoiding common errors 
when converting code for multitasking. 

1. Multitasking mechanisms have been properly declared and initialized. 

• Task identifier array is dimensioned 2 or 3. 

• First element of task identifier array is initialized. 

• Subroutine tasks are declared EXTERNAL. 

© All event variables have been assigned before being used. 

© All event variables are accessible to tasks that use them. 

• Any necessary initialization of events is done. 

• All lock variables have been assigned before use. 

© All lock variables are accessible to tasks that use them. 

© Any necessary initialization of locks is done. 

2. Multitasking mechanisms are used correctly. 

m Every TSKSTART has a corresponding TSKWAIT . 

9 Every EVPOST has a corresponding EVWAIT , EVCLEAR . 

• Every LOCKON has a corresponding LOCKOFF. 
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3. Independence is guaranteed* 

• Between synchronization points , tasks do not rely on quantities 
computed in other tasks. 

• Except for monitored variables, tasks compute variables stored 
in separate storage locations. 

• The order of execution of tasks is immaterial. Synchronization 
points are used when the order is important. 


8. 2 COS TASKS VERSUS USER TASKS 

When looking at job outputs from a multitasked job, do not be confused by 
COS uses of the term task • In section 4 of this publication, the 
concept of the logical CPU was introduced. The references in COS-r elated 
outputs to task refer to a logical CPU and not to .the task or tasks 
created by the user. The number of COS tasks may be equal to the number 
of user tasks, but the user tasks may have been assigned over time to 
one, several, or all of the COS tasks. 

Output such as the following may lead to this confusions 

• CHARGES. If multiple COS tasks are created during the lifetime of 
a user job, CHARGES generates clock period (CP) time, waiting for 
CP time, and waiting for I/O time values for each task and for the 
total job. 

» DUMP. If multiple COS tasks are active when a DUMP JOB is 

executed, the DUMP output shows an Exchange Package for each task, 
starting with the task in execution when the DUMPJOB was executed. 

• DEBUG. If multiple COS tasks are active when a DUMPJOB is 
executed, DEBUG traces back through each task, starting with the 
task in execution when DUMPJOB was executed. 


8.3 CONDITIONAL MULTITASKING 

Section 7 describes a technique that allows a multitasked program to be 
easily run in a nonmult itasked mode. This allows isolation of some types 
of multitasking errors. 
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3.4 ELIMINATING COS MULTITASKING 


A multitasked program can be run in a single-tasked mode. This is 
accomplished by using the multitasking tuning call (see section 4) to set 
the maximum number of logical CPUs to one. The library scheduler 
continues to execute and processes multiple tasks , but at most one task 
is known to COS. 

In effect, this leads to tasks being executed largely in sequential 
order, with task switches only taking place when the executing task 
becomes blocked. The sequence of task execution is reproducible from run 
to run. This procedure can help with the following types of problems: 

Problems not related to multitasking. The problem should still 
occur, but with less complexity surrounding the issue. 

# Synchronization problems caused by two tasks simultaneously 
entering critical regions. This approach partially, but not 
completely, closes this window. 


8.5 TIMER TRACE 


A timer trace package is currently available on an unsupported basis from 
Cray Research. The package can be used to record and display the 
sequence of task operations. The package consists of the following 
subroutines: 

• TTINIT to initialize the tracing capability 

• TTRACE to record an event (task number, message, data, time) 

• TTPRINT to print a trace (up to 1024 most recent events) 

• TTCLEAR to flush the previous history and start new history 

The programmer inserts calls to the routines (primarily TTRACE) into the 
code at points of interest. The output from the TTPRINT call is divided 
by task and listed sequentially; it displays the message, data, time, and 
intervals between events. As designed, the package supports multitasking 
of two tasks. 

This package is flexible and can be used to help debug a number of timing 
and synchronization problems. 

For further information on the package or to obtain a copy of it, have 
your Cray Research site analyst contact Field Liaison/Support. 


3N-0222 


8-3 


8.6 C3XRF - COMMON BLOCK CROSS REFERENCE 


CBXRF is a general CFT analysis tool that is under development at Cray 
Research. It processes a CFT program and produces the following outputs: 

• For each common block, the names of each module that use it 

# For each common block, a detailed cross reference, by variable, of 
the module and the lines that reference it 

• For each module, information on its entry points, whom it calls, 
who calls it, and common blocks it uses 

# A static calling tree, displaying module calls in a graphic manner 
through the use of indentation. The tree can be started at any 
point,, allowing for the display of specific subtrees. 

Additional analysis and outputs are planned. 

CBXRF is of particular value for analysis of multitasking because of* its 
ability to collect and consolidate information, on a global basis, on 
global variables and their use by the subroutines within a CFT 
application. 

This product is planned for the COS 1.14 release, but a prototype is 
available on a prerelease basis. For further information on the product 
or for prerelease information, have your Cray Research site analyst 
contact Field Liaison/Support. 


8 . 7 FLOWTRACE 


The FLOWTRACE package, described in the CFT Reference Manual, Cray 
Research publication SR-0009, is useful when analyzing programs for 
multitasking (see section 7) . 

Once a program has been multitasked, however, FLOWTRACE is of little 
use. Because control is passed between subroutines by the library 
scheduler, the information presented by FLOWTRACE may be misleading. 

For example, a multitasked program was run in which three tasks each 
executed loops of the same size and time. Events were used to force a 
certain order of execution. Following the run, the following FLOWTRACE 
output was obtained. On the surface, it appears that MTASK1 has little 
execution time, although it actually needed more CPU time than the other 
two. This results from the fact that the library scheduler passed 
control back to MTASK1 from the other two tasks without executing the 
FLOWTRACE code. 
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FLOW TRACE SUMMARY 



ROUTINE 

TIME 

% 

CALLED AVERAGE T 

1 

MTASK1 

0.000194 

0.00 

1 0.000194 





CALLS MTASK2 MTASK3 

2 

MTASK2 

2.934426 

54.02 

1 2.934426 





CALLED BY MTASK1 

3 

MTASK3 

2.497124 

45.97 

1 2.497124 





CALLED BY MTASK1 

*** 

TOTAL 

5.431744 



*** 

OVERHEAD 

0.000064 




In another job, however, FLOWTRACE aborted because the order of task execution 
led to a subroutine returning without having been entered (as far as FLOWTRACE 
was concerned) . The program and its multitasking design were correct but 
conflicted with assumptions made in the design of FLOWTRACE. 


8.8 INTERPRETING TRACEBACKS 

If a multitasked job aborts, the traceback reflects the last entry into the 
multitasking package. The following example demonstrates such a traceback: 

AR004 - BAD SCALAR ARGUMENT TO ARLIB MATH ROUTINE 
TB001 - BEGINNING OF TRACEBACK 


- $TRBK 

WAS 

CALLED 

BY 

ARERP% 

AT 

24771b 

- ARERP% 

WAS 

CALLED 

BY 

SQRT% 

AT 

25425d 

- SQRT% 

WAS 

CALLED 

BY 

MTASK2 

AT 

455d 

- MTASK2 

WAS 

CALLED 

BY 

TSKSTART 

AT 

24215d 

- TSKSTART 

WAS 

CALLED 

BY 

MTASK1 

AT 

313b 


TB002 - END OF TRACEBACK 


8.9 DEADLOCK DETECTION 

The library scheduler detects software deadlock. If one or more tasks exist 
but all are queued for events or locks, the library scheduler recognizes the 
condition and generates a message. The library scheduler does not recognize 
if a subset of all tasks are deadlocked, so the message generated may occur 
long after the deadlock situation initially arose. 



n 
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ADVANCED CONCERNS 



9.1 MULTITASKING IN OTHER LANGUAGES 


Multitasking programs may include user software written in CRI-supported 
languages other than FORTRAN. This section addresses some of the areas 
of special concern. 


9.1.1 PASCAL 

Pascal software compiled with Pascal release 1.00 cannot be used in a 
multitasked job. This compiler generates. code that uses a stack 
mechanism different from that used by the multitasking library 
subroutines. In Pascal release 2.00, the same stack mechanism will be 
used by both types of software. At that point, Pascal code could be 
included in a multitasked job by compiling the code with the Z+ 
(reentrant) opt ion . 


9.1.2 CAL 

Software written in CAL can generally be modified to work in a 
multitasking program, but the programmer must take care. 

First, the subroutine should use the new calling sequence introduced with 
the COS 1.12 release. Using ENTER, EXIT, and associated macros is 
recommended. If the CAL subroutine is to be reentrant, it must use the 
stack calling sequence and should access local variables using the LOAD 
and STORE macros. (These macros are described in the Macros and Opdefs 
Reference Manual, CRI publication SR-0012.) A subroutine coded with 
ENTER, EXIT, LOAD, STORE, and associated macros produces stack-based code 
if assembled with a version of $SYSTXT that has the Stack flag set. 

Second, the programmer should ensure that global variables are stored in 
memory before subroutine calls that may lead to a task switch. COMMON 
blocks can be defined within CAL code for consistency with CFT global 
variables. 

Third, care must be taken if I/O is included. The I/O tables (LFTs and 
DSPs) , the datasets, and the I/O buffers may require treatment as shared 
variables requiring protection. 
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Finally, the multitasking library subroutines should be called as 
described in the Library Reference Manual, CRI publication SR-0014. 
Register preservation assumptions are as with any library call. 

For examples of CAL subroutines that have been modified to work in a 
multitasking environment, a programmer could look at subroutines in the 
various CRI default libraries ($IOLIB, $UTLIB, $ARLIB, $FTLIB , $SCILIB, 
or $SYSLIB) . 


NOTE 

CAL subroutines used in a multitasking system must not 
use hardware semaphore registers 0 through 16? these 
registers are reserved for system use. Direct use of 
cluster registers and shared B and T registers should 
be done with care, since system software and 
CFT-generated code may use these currently or in the 
future. 


9.2 USING COS MULTITASKING MACROS 

The Macros and Opdefs Reference Manual, CRI publication SR-Q012, 
describes several macros that directly create and delete COS logical CPUs 
from user code. The library scheduler uses these macros. Users can also 
use the macros, but they will be unable to synchronize between tasks 
created in this manner and any tasks created using the library routines 
described in this manual. In general, COS macros and library calls 
should be viewed as mutually exclusive. 


9.3 BATCH USE OF MULTITASKING 

As mentioned in section 1, multitasking is aimed at the dedicated user. 
Jobs can be run in a batch environment, though, and batch can be useful 
for program development and debugging. Some suggestions for more 
effective batch use follows 
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Section 3 includes a discussion of a tuning subroutine, TSKTUNE. 
One of the parameters to this routine defines the maximum number 
of CPUs that will execute in an idle loop if otherwise unneeded* 

If a program tends to keep CPUs unused for long periods of time, 
the programmer should consider setting the parameter to zero in a 
production batch environment. Otherwise, the CPU will execute the 
idle loop and hence be unavailable to other jobs in the system. 

Performance in a batch environment is highly variable. If 
performance testing during batch is considered important, a site 
could establish a job class for multitasking that assigns a higher 
priority to such jobs. However, tasks in such jobs are still 
scheduled individually, depending upon their priority and those of 
other jobs present in the system. Sites should not introduce such 
a job class without carefully considering the impact on system 
throughput. 
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MULTITASKING ON A CRAY-1 


A 


With the COS 1.13 release, multitasking codes can be run on a CRAY-1 
Computer System through a library simulation of the hardware multitasking 
functions available on a CRAY X-MP. A $UTLIB built for a CRAY-1 system 
generates code that does the following: 

• Limits the maximum number of COS CPUs to one 

• Disables the hardware test and set function 

• Disables the TSKTUNE subroutine (parameter names are checked but 
the values are not changed) 

A program whose source has been modified for multitasking should be 
recompiled and the absolute module rebuilt if transferred between machine 
types. Binaries, especially absolute binaries, are not transportable. 

Multitasked codes that run correctly on a CRAY-1 Computer System execute 
correctly on a CRAY X-MP Computer System, but the converse is not always 
true. For example, a program could be set up in which the 
synchronization between two tasks is by way of common variables in 
memory, and one task loops until a second task has updated the 
variables. Although this is not a recommended design, it could execute 
correctly on a CRAY X-MP Computer System (if the machine is dedicated and 
the update period is long) because COS will time slice the logical CPUs 
assigned to the user job. Over time, both tasks will execute. 

Under the simulation mode, an opportunity may never arise for the library 
scheduler to swap control between the two tasks, since no explicit 
synchronization is performed. Hence one task or the other could retain 
control, either looping forever or updating forever, until the job's time 
limit is exceeded. 

The new COS system calls to create and delete tasks are available for 
CRAY-1 systems in the COS 1.13 release, but the lack of an intertask 
synchronization mechanism (as provided by the CRAY X-MP hardware) makes 
these generally useless. 
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MESSAGES 


B 


The following messages may be encountered during the development and 
testing of a multitasked application. For information on other messages, 
see the CRAY-OS Message Manual, publication SR-0039. 


AB199 - MAXIMUM USER TASKS PER JOB EXCEEDED 

The user has created more than I0MAXNUT (a COS installation parameter) 
active logical CPUs. This occurs only if TSKTUNE was called with MAXCPU 
set above I0MAXNUT. 

UT013 - FATAL STACK OVERFLOW 

Insufficient space is available for expansion of the stack. Insufficient 
space was allocated for the stack, and the increment is zero on the LDR 
3TK or the SEGLDR STACK directive is zero. 

UT015 - EVREL CALLED WITH TASKS WAITING FOR EVENT 

An event variable was released with EVREL but was being waited for by 
some task. 

UT016 - LOCKREL CALLED WITH LOCK SET 

A lock variable was released with LOCKREL but was currently in use by 
some task. 

UT017 - INVALID LOCK IDENTIFIER 

LOCKREL was called but the specified lock variable appears to be 
invalid. Check that the lock variable was assigned and that it was not 
accidentally overwritten. 

UT019 - HEAP IS FULL, CAN’T SATISFY REQUEST 

Insufficient space is available for expansion of the heap. Either 
insufficient memory space remains in the job's field length, or the 
increment on the LDR MM or SEGLDR HEAP directive is zero. 

UT024 - DEADLOCK - ALL USER TASKS WAITING FOR LOCKS OR EVENTS 

The library detected a situation in which all active tasks are suspended 

for events or locks. 

UT025 - UNRECOGNIZED SCHEDULER PARAMETER NAME 

An ASCII string passed as a TSKTUNE parameter was not recognized. 
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APPROXIMATE TiMSNGS 


C 




This appendix contains approximate timings for the multitasking library 
subroutines. These timings are subject to change and are provided only 
for planning purposes. The timings are grouped by the subroutine 
groupings used in section 3. 

Parallelism subroutines: 


Subroutine and beginning conditions 


Clock periods 


TSKSTART (first call in program) l,500,000 f 

(later call, if logical CPU' ^ needed) 40,000 

(later call,, if logical CPU not needed) 2,500 

TSKWAIT (task completed execution) 200+^^ 

(task exists) 1,500+^ 

TSKVALUE 200 


Protection subroutines: 


Subroutine and beginning conditions 


Clock periods 


LOCKASGN 200 

LOCKON (lock free) 200 

(lock locked) l,500+§§ 

LOCKOFF (no tasks waiting) 200 

(tasks waiting) 1,500 

LOCKREL 200 


t The value of 1,500,000 for TSKSTART is a worst case and may occur 

when a memory expansion must obtain stack space. Parameters on the 
LDR card can bring about this memory allocation at load time rather 
than run time (see section 3) . Experience has shown that for most 
multitasking applications, the initial one or two TSKSTART calls are 
subject to the larger times shown above, while all subsequent calls 
take only about 2500 clock periods. Section 3 describes the events 

that could lead to larger times, but these are typically uncommon. 

ft Logical CPUs are discussed in section 3. 
t ttt Pius approximately 25 clock periods for each existing task. 

§ Plus time spent waiting task to complete execution. 

§§ Plus time spent waiting for lock to be unlocked. 
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Synchronization subroutines: 

Subroutine and beginning conditions Clock periods 

EVASGN 200 

EVWAIT (event posted) 200 

(event clear) 1,500+^ 

EVPOST (no tasks waiting) 200 

(tasks waiting) 1,500 

EVCLEAR 20 0 

EVREL 200 


t Plus time spent waiting for event to be posted. 


SN-0222 


02 



D 


Three function subprograms, one each for tasks, locks, and events, can be 
used to obtain the status of an entity. The following descriptions are 
not described in section 5 because the subprograms are of limited use and 
carry a high risk. The risk is that a task using one of these 
subroutines may unintentionally enter a busy wait or a spin lock 
condition locking all other tasks out of execution. For this reason, 
using these features in general code is discouraged. 


D.l TSKTEST 

TSKTSST returns a value indicating whether or not the indicated task 
exists. 


Format: 


return * TSKTEST ( taskarray) 


return A logical .TRUE, if the indicated task exists; a logical 
.FALSE, if the task was never created or has completed 
execution 

taskarray Task control array 


NOTE 

TSKTEST must be declared LOGICAL in the calling module 


SN-G222 


D-l 



Do 2 LOCKTEST 


LOCKTEST tests if a lock is in the locked state. LOCKTEST acts the same 
as LOCKON, except the task never waits. A task using LOCKTEST must 
always look at the return value before continuing. 


Format : 


return - LOCKTEST ( name) 


return A logical .TRUE, if the lock was originally in the locked 
state; a logical .FALSE, if the lock was originally in the 
unlocked state. The lock variable f s state is always set to 
locked upon return. 

name Name of an integer variable used as a lock 


NOTE 

LOCKTEST must be declared LOGICAL in the calling module. 


D. 3 EVTEST 


EVTEST tests if an event is posted. 


Format s 


return EVTEST ( name) 


return A logical .TRUE, if the event is posted; a logical .FALSE, if 
the event has never been posted or is cleared. The event 
variable's state is unaffected by a call to EVTEST. 

name Name of an integer variable used as an event 
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NOTE 


EVTEST must be declared LOGICAL in the calling module. 





GLOSSARY 


Assign - Identifying a variable that the program intends to use as a lock 
or event. Locks and events must be assigned before they can be used. 

Clear - (1) An event state indicating that no signal is outstanding. (2) 
An operation causing the event state to change to clear. 

COBEGIN - A sequence of independent program segments 

Common - Accessible to multiple parts of a program. Common is a type of 
scope . 

Computational dependence - A form of dependence resulting from either 
control dependence , data dependence, or both 

Control dependence - A form of dependence that occurs when the order of 
execution depends upon preceding segments of code 

Critical region - A segment of sequential code that accesses a shared 
resource 

Data dependence - A form of dependence that occurs when the data 
resulting from one segment of code depends upon the data resulting from 
preceding segments of code 

Deadlock - A condition in which locks and synchronization mechanisms have 
been misused to the point where a task is waiting for something to happen 
that will never happen 

Deadlock detection - Recognizing a deadlock situation after the deadlock 
has occurred 

Deadlock prevention - Use of procedures or rules to ensure that deadlock 
does not occur 

Deadly embrace - A form of deadlock 

Dependence graph - A pictorial representation of the relationships 
between segments of code in a program 

DO ALL - A loop with independent iterations 

DQPIPE - A software pipeline of program segments within a loop. Data 
dependencies prevent the loop from having independent iterations. 
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Event - (1) Facility to allow signaling between tasks. Events have two 
states: cleared and posted. (2) Variable used to represent an event. 

Heap - Memory space between the job and the I/O buffers in user space 

Library scheduler - A library subroutine that assumes primary responsibility 
for managing and scheduling the tasks within a program 

Library task - A task created by a user job using the library calls described 
in this publication. A library task is referred to simply as a task in this 
publication. ‘ ;l V.-,r .• 

Load balancing - A process used to ensure that the amount of work done by each 
of the processors involved in a job is approximately equal 

Local - Accessible only to a particular part of a program (usually a single 
"module) . ^ Local is a type of scope. 

Lock - (1) Facility to monitor critical regions of code. Locks have two 
states -locked and unlocked. (2) Variable used to represent a lock. 

Logical CPU ! - Entity scheduled by COS for execution on physical CPUs: a user 
task ." * ' * • 

Monitor - Controlling access to critical regions 

Multiprogramming - A mode of operation that provides for the sharing of 
processor resources among multiple independent processes 

Multiprocessing - A mode of operation that provides for parallel processing by 
two or more processors 

Multitasking - A special case of multiprocessing in which a task is a subjob 
or a subprogram 

Mutual exclusion - A property of a critical region in which at most one task 
can execute it at a time 

Nondeterministic - Not able to determine from the start. Multitasking is 
nondeterministic with respect to time; the order of execution of parallel 
tasks cannot be determined from run to run. 

Nonreentrancy - A property of a program module that allows it to be used only 
once. Such a module is called nonreentrant. 

Parallelism - Simultaneous processing of jobs, parts of jobs, programs, or 
parts of programs. The order of execution for code segments that execute in 
parallel typically cannot be determined ahead of time. 

Post - An operation causing the event state to change to posted 
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Posted - An event state indicating that a signal is outstanding 

Reentrancy - A property of a program module that allows one copy of it to be 
used by more than one job or task in parallel. Such a module is called 
reentrant. . .. .. ( ...... 

Release - Indicating that a variable is no longer intended for use as a lock 
or event , . .. •••>.,. r 

Scope - Region of a program in which a variable is defined and can be 
referenced. See local, common, and shared. 

Scope boundaries - Beginning and end of the region of a .program' that is the 
scope of a variable 

Serial reusability - A property of a program module that allows it to be used 
multiple times but by no more than one' task at a time 

Shared - Referenced by multiple parts, of a program. Shared is a type of scope. 

Stack - A data structure providing a dynamic,, sequential data list, having 
special provisions for access from one end or the other. A last in, first out 
{push down, pop up) stack is accessed from just one end. 

Stackf rame - Element of a stack. A stackframe is allocated when 'a reentrant 
subroutine is entered and deallocate pn , exit. . ... 

Starvation - A characteristic of a multitasking program in which one or more 
tasks get no (or virtually no) execution, time, qn a physical . !CPU 

Storage dependence - A form of dependence that occurs when tasks share 
variables • ...... __ > r ., .. 

Synchronization - The process of coordinating the steps within processes that 
can be run in parallel ~~ ... ... , ... ... ... 

Synchronization point - A point in time at which a task has received the 
go-ahead to proceed with its processing .... . 

System task - One of the modules that comprises the Cray Operating System 
(COS) ; for example. Disk Queue Manager (DQM) or Station Call Processor (SCP) „ 

Task - A software process. A . tasj$ is. a unit of computation that can 'be 
scheduled and whose instructions must be processed in sequential order. At 
Cray Research, task means a subprogram. ... r . . 

Task_coimnon - Data that must be common tp all subroutines that are executed by 
a single task but should be local to that task 

Task control array - A data structure used to represent a user-created task 
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Task granularity - Approximate execution time of a task, usually given as an 
order of magnitude 

Task information block - An area in the base of the task's stack that contains 
internal information about the task 

Task value - An optional word within a task control array that may be set to 
any value by the user before creating the task 

User task - Entity scheduled for execution by COS. A user task is referred to 
as a logical CPU in this publication. 
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